An easy and cheap way to achieve hot back­up and dis­as­ter re­cov­ery with XA...

This ar­ti­cle is the sec­ond part in its se­ries. Where­as the first part in­tro­duced the prob­lem, this sec­ond part will present a ba­sic, cheap and easy so­lu­tion that main­tains a hot back­up for zero data loss in case of dis­as­ter.

The ba­sic idea: syn­chro­nous DBMS repli­ca­tion with XA

All so­lu­tions pre­sent­ed in this ar­ti­cle are built upon the same ba­sic no­tion of syn­chro­nous DBMS repli­ca­tion with XA:

hot-backups-with-xa-and-replica-dbms.png

As­sum­ing that the pri­ma­ry and sec­ondary DBMS are in dif­fer­ent data cen­ters, this en­sures that we al­ways have a copy of the data. It gives us a hot back­up at all times, with­out the need for ex­pen­sive DBMS repli­ca­tion tech­nol­o­gy. It's also ide­al for cloud en­vi­ron­ments since you don't need heavy en­ter­prise DBMS tools to set­up the repli­ca­tion. In­stead, our cloud-na­tive trans­ac­tion tech­nol­o­gy is all you need.

The ap­pli­ca­tion could be a web ap­pli­ca­tion, a mi­cro-ser­vice or any oth­er serv­er-based ap­pli­ca­tion.

Note that cloud plat­forms al­ready in­clude back­up / failover so­lu­tions - but these are typ­i­cal­ly not hot back­ups, plus they de­pend on ven­dor-spe­cif­ic mech­a­nisms (mak­ing your cloud ap­pli­ca­tions less portable across plat­forms). Our cus­tomers (most­ly in fi­nan­cial ser­vices) of­ten pre­fer XA be­cause they al­ready use it and have a lot of ex­pe­ri­ence with it.

Deal­ing with dis­as­ter

Let's briefly go over how you can deal with dis­as­ter sce­nario's...

Sud­den and per­ma­nent loss of in­com­ing re­quests

We as­sume that re­quests can get lost, it is up to the client (not shown) to retry failed re­quests. This means that we as­sume that the client can con­sult the pri­ma­ry and / or sec­ondary DBMS to de­ter­mine if retry is need­ed or not. Clients can as­sume that re­quests are dealt with atom­i­cal­ly, i.e. both pri­ma­ry and sec­ondary are up­dat­ed, or both have rolled back (how this works should be­come clear be­low).

Sud­den and per­ma­nent loss of the pri­ma­ry DBMS

The pri­ma­ry DBMS can be re­stored from the sec­ondary, since they are al­ways kept in sync.

Be­fore re­con­struct­ing it, the fol­low­ing steps are need­ed:

  • The for­mer pri­ma­ry's pend­ing trans­ac­tions have to be purged from the trans­ac­tion logs since they no longer have any use (the re­con­struct­ed pri­ma­ry will not re­mem­ber any pend­ing trans­ac­tions) - this will soon be avail­able as part of The LogCloud tech­nol­o­gy.
  • The sys­tem is tem­porar­i­ly put into read-only mode (mean­ing client re­quests for up­dates will tem­porar­i­ly fail ex­cept when we do what will be ex­plained in the next part).
  • Distrib­uted trans­ac­tion re­cov­ery for the sec­ondary is al­lowed to ter­mi­nate be­fore the pri­ma­ry is re­con­struct­ed, so all pend­ing trans­ac­tions are ter­mi­nat­ed and a clean, sta­ble data­base view is avail­able with­out any pend­ing locks. Again, this will soon be part of The LogCloud tech­nol­o­gy.

The last step is nec­es­sary be­cause at the time of dis­as­ter, al­most by de­f­i­n­i­tion there will be pend­ing trans­ac­tions in both the (lost) pri­ma­ry and the sec­ondary DBMS. This step en­sures that the sec­ondary DBMS is in a qui­et state (no pend­ing up­dates) when a new pri­ma­ry is cre­at­ed from it. Other­wise, we would risk run­ning into locks be­cause of re­main­ing in-doubt trans­ac­tions - which will ef­fec­tive­ly be cleaned up by the trans­ac­tion re­cov­ery.

This works cor­rect­ly, be­cause at the time of a dis­as­ter:
  • All pos­i­tive re­spons­es pre­vi­ous­ly re­turned to the client are still tak­en into ac­count in the sec­ondary DBMS state (be­cause a pos­i­tive re­turn val­ue is only sent af­ter suc­cess­ful com­mit, mean­ing af­ter both DBMS have com­mit­ted) - so there is no data loss
  • All pend­ing trans­ac­tions are ter­mi­nat­ed cor­rect­ly by the trans­ac­tion re­cov­ery and the re­sults copied to the new pri­ma­ry, and
  • Th­ese pend­ing trans­ac­tions be­come vis­i­ble in both DBMS af­ter the re­store is done - i.e., "even­tu­al con­sis­ten­cy"

Sud­den and per­ma­nent loss of the sec­ondary DBMS

In a sim­i­lar way, the sec­ondary DBMS can be re­stored from the pri­ma­ry.

Sud­den and per­ma­nent loss of the trans­ac­tion logs

For now, there is not much that can be done if the trans­ac­tion logs are lost - so a mir­rored disk ap­proach or repli­cat­ed disks of some form are high­ly rec­om­mend­ed. Disk repli­ca­tion is pre­sum­ably cheap­er than full DBMS repli­ca­tion tech­nol­o­gy, so we think this is ac­cept­able. In the fu­ture, we may be able to elim­i­nate the logs - but for now this is what has to be done. With our The LogCloud, only the ded­i­cat­ed log­ging and re­cov­ery ser­vice needs these mir­rored disks. This can be done in a pri­vate cloud dat­a­cen­ter, for in­stance.

Com­bined loss­es

Deal­ing with com­bined loss­es means we have to cope with two or more sys­tems fail­ing to­geth­er. While this is cer­tain­ly hard­er to deal with, the whole as­sump­tion be­hind a pri­ma­ry and sec­ondary DBMS is that it is ex­treme­ly un­like­ly that two sys­tems will fail at the same time. So by de­f­i­n­i­tion, com­bined loss­es are be­yond the scope of this ar­chi­tec­ture be­cause they make the idea of pri­ma­ry / sec­ondary hot back­ups point­less in the first place.

Of course, to pre­vent com­bined loss­es all of the re­sources (pri­ma­ry DBMS, sec­ondary DBMS and trans­ac­tion logs) should be host­ed in dif­fer­ent data cen­ters.

Wrap­ping up

That's it, we've out­lined a cheap and easy way to set­up a hot-back ar­chi­tec­ture with zero loss in case of dis­as­ter strik­ing on one of the two DBMS sys­tems. What re­quired very ex­pen­sive en­ter­prise soft­ware in the past can now be done much cheap­er and much eas­i­er thanks to our cloud-na­tive trans­ac­tion pro­cess­ing soft­ware!

Note that while XA seemed to be a draw­back in part 1 (mak­ing the prob­lem a bit more com­plex), it ac­tu­al­ly turned out to be an ad­van­tage for the so­lu­tion.

What's next?

Stay tuned for the next part in this se­ries, where we will show an­oth­er cheap and easy trick to scale things up hor­i­zon­tal­ly - and even avoid failed client up­date re­quests when the sys­tem is do­ing failover.

Can't wait?

Do you pre­fer to get start­ed and try things on your own?

Down­load our FREE JTA/XA here

Your take

So what is your ex­pe­ri­ence with dis­as­ter re­cov­ery? Feel free to share any com­ments be­low…
RSS

Com­ments

Add a com­ment

Cor­po­rate In­for­ma­tion

Atomikos Cor­po­rate Head­quar­ters
Hove­niersstraat, 39/1, 2800
Meche­len, Bel­gium

Con­tact Us

Copy­right 2026 Atomikos BVBA | Our Pri­va­cy Pol­i­cy
By us­ing this site you agree to our cook­ies. More info. That's Fine