One of the lat­est chal­lenges in com­put­er sci­ence seems to be the CAP the­o­rem. It ad­dress­es a per­ceived im­pos­si­bil­i­ty of build­ing large-scale and clus­tered (web) ser­vice ar­chi­tec­tures. The fact that it (sup­pos­ed­ly) has been proven to be true makes what I am go­ing to write here all the more un­like­ly. Still, read on be­cause I will show that I am right and CAP is not an im­pos­si­bil­i­ty af­ter all... While the im­pos­si­bil­i­ty proof of CAP is math­e­mat­i­cal­ly cor­rect, it is based on as­sump­tions that are too strict. By re­lax­ing these as­sump­tions, I found the so­lu­tion pre­sent­ed here.

What is CAP?

The CAP the­o­rem (short for con­sis­ten­cy, avail­abil­i­ty, par­ti­tion-tol­er­ant) es­sen­tial­ly states that you can­not have a clus­tered sys­tem that sup­ports all of the fol­low­ing three qual­i­ties:

Con­sis­ten­cy
Con­sis­ten­cy is a qual­i­ty mean­ing (in­for­mal­ly speak­ing) that reads and writes hap­pen cor­rect­ly. In oth­er words, the over­all ef­fect of ex­e­cut­ing thou­sands or mil­lions of trans­ac­tions con­cur­rent­ly is the same as if they had been ex­e­cut­ed one-at-a-time. Usu­al­ly, this is done with the help of a trans­ac­tion man­ag­er of some sort.
Avail­abil­i­ty
Avail­abil­i­ty es­sen­tial­ly means that every op­er­a­tion (that makes it to a non-fail­ing node) even­tu­al­ly re­turns a re­sult.
Par­ti­tion-tol­er­ant
This qual­i­ty refers to the pos­si­bil­i­ty of tol­er­at­ing par­ti­tions on the net­work. Note that we sup­pose a clus­ter ar­chi­tec­ture (which is where the net­work comes in).

CAP is a con­jec­ture orig­i­nal­ly for­mu­lat­ed by Eric Brew­er (Ink­to­mi) and has in­flu­enced many of to­day's larg­er-scale web­sites like . In oth­er words, the im­pact of CAP is very large. To make it worse, the per­ceived im­pos­si­bil­i­ty of a CAP sys­tem (one that has all three de­sir­able prop­er­ties) has lead peo­ple to ad­vo­cate some­thing called BASE (Ba­si­cal­ly Avail­able, Soft-state and Even­tu­al­ly Con­sis­tent) - see this talk by Wern­er Vo­gels (CTO at Ama­zon).

As far as I know (but I could be wrong), a the­o­ret­i­cal foun­da­tion of BASE does not ex­ist yet (it seems more of an in­for­mal ap­proach which to me rais­es se­ri­ous ques­tions con­cern­ing cor­rect­ness). In this post I will present:
  • a CAP so­lu­tion
  • how this con­forms to what BASE wants to achieve
  • a "de­sign pat­tern" for build­ing cor­rect sys­tems that (in a way) of­fer both CAP and BASE qual­i­ties
Be­cause CAP is per­ceived as im­pos­si­ble and be­cause BASE lacks for­mal treat­ment, I con­sid­er this to be a sig­ni­fi­ca­tion con­tri­bu­tion to the state of to­day's en­gi­neer­ing;-)

What about the proof of Brew­er's the­o­rem?

Brew­er's proof has been pub­lished by Nan­cy Lynch et al and dis­cussed by me (see my ear­li­er post and also this one).

While the the­o­ret­i­cal proof of the im­pos­si­bil­i­ty of CAP is valid, it has a big lim­i­ta­tion: it as­sumes that all three CAP prop­er­ties have to be sup­plied at the same mo­ment in time. If you drop this as­sump­tion, then all of a sud­den you get into a new spec­trum of pos­si­bil­i­ties. This is what I will do here.

A CAP so­lu­tion

Enough talk, let's get to the core of the mat­ter. Here is my so­lu­tion to CAP. To make it con­crete, I will use the con­cept of a web-shop like Ama­zon. Here are the rules that are suf­fi­cient to en­sure CAP:
  1. Pro­cess reads from the data­base if pos­si­ble, or use a cached val­ue if need­ed for avail­abil­i­ty (if the DB is un­reach­able).
  2. All reads use ver­sion­ing or an­oth­er mech­a­nism that al­lows op­ti­mistic lock­ing.
  3. Up­dates sup­plied by clients (or­ders in case of Ama­zon) are queued for ex­e­cu­tion, and in­clude the ver­sion­ing in­for­ma­tion of the reads that lead to the up­date.
  4. Queued up­dates are processed when the num­ber of par­ti­tions is low enough to do so. The eas­i­est way to do this is with a clus­ter-wide dis­trib­uted trans­ac­tion across all repli­cas (more on scal­a­bil­i­ty lat­er), but oth­er more re­fined ways are pos­si­ble (such as quo­rum-based repli­ca­tion or any oth­er smart way of repli­cat­ing). The ver­sion in­for­ma­tion in the up­date is used to val­i­date it: if the data in the data­base has been mod­i­fied since the orig­i­nal read(s) that lead to the up­date, the up­date is re­ject­ed and a can­cel­la­tion is re­port­ed back to the client. Other­wise the or­der is processed and a con­fir­ma­tion is re­port­ed back to the client.
  5. The re­sults (con­fir­ma­tion or can­cel­la­tion) are sent asyn­chro­nous­ly to the clients. This can be ei­ther email, mes­sage queu­ing, or any oth­er asyn­chro­nous de­liv­ery method.
That's it. Ad­here to these guide­lines, and you have a CAP ar­chi­tec­ture. I will not pro­vide a for­mal proof here (I in­tend to do that else­where, in a re­search pa­per), but in­tu­itive­ly the proof is as fol­lows:
  • This sys­tem is con­sis­tent be­cause reads are based on snap­shots and in­cor­rect up­dates are re­ject­ed be­fore they are ap­plied. In oth­er words: there are no in­cor­rect ex­e­cu­tions.
  • This sys­tem is avail­able since reads al­ways re­turn a val­ue, and so do writes (even though they are queued and it may take a while).
  • This sys­tem is par­ti­tion-tol­er­ant be­cause it al­lows net­work and node fail­ures.
Grant­ed, this sys­tem does not pro­vide all three at the same mo­ment in time (which is how we go around the im­pos­si­bil­i­ty), but nev­er­the­less the re­sult is quite strong IMHO.

The lim­i­ta­tions

There are some lim­i­ta­tions to this so­lu­tion - all of which seem rea­son­able:
  1. Read-only re­quests may be pre­sent­ed with stale in­for­ma­tion (due to up­dates that have yet-to-be-ap­plied). In that sense, their re­sults could be "in­con­sis­tent": for in­stance, the avail­abil­i­ty of an Ama­zon item can change be­tween two page views. I do not see this as a ma­jor re­stric­tion, since no web­site that I know of will of­fer read con­sis­ten­cy for the du­ra­tion of a user ses­sion. It all de­pends on what you con­sid­er to be with­in the scope of one trans­ac­tion;-) Note that this al­most cor­re­sponds to snap­shot iso­la­tion found in Or­a­cle.
  2. Par­ti­tions should not last for­ev­er: in or­der for this to work, par­ti­tions should be re­solved with­in a rea­son­able time (rea­son­able be­ing: with­in the ex­pect­ed con­fir­ma­tion time for up­dates). The du­ra­tion of any par­ti­tions also af­fects the time win­dow in which reads can pro­duce stale data.
  3. The up­dates have to be ap­plied in the same rel­a­tive or­der at all clus­ter nodes. This puts some re­stric­tions on the al­go­rithm used to do this.
Note that up­dates are al­ways based on cor­rect reads thanks to the ver­sion­ing check be­fore they are ap­plied. So up­date trans­ac­tions are al­ways con­sis­tent.

How does this re­late to BASE?

You could see this as a de­sign pat­tern for BASE if you like. The so­lu­tion ad­heres to BASE in the sense that it uses cached reads (if need­ed) and that the up­dates are de­layed (so you could say they are "even­tu­al­ly" ap­plied and the sys­tem be­comes "con­sis­tent").

Re­flec­tions in scal­a­bil­i­ty

So far the CAP fo­cus was on pos­si­bil­i­ty. I think my so­lu­tion shows that it is pos­si­ble. Now how about scal­ing up?

The naive so­lu­tion (a huge dis­trib­uted trans­ac­tion to up­date all clus­ter nodes in-sync) is un­like­ly to scale: as you add more nodes, more up­dates are need­ed. Now I am a big fan of trans­ac­tions, but not to use them in an ar­bi­trary mat­ter. So how to prop­a­gate these up­dates through the clus­ter?

While smarter so­lu­tions for this ex­ist (such as the work by Bet­ti­na Kemme), a triv­ial first try would be to push up­dates (lazi­ly) to all nodes in the clus­ter. This can be done with a smart queu­ing mech­a­nism. The dis­ad­van­tage is that up­dates are not ap­plied every­where at once (rather, the all-or-noth­ing qual­i­ty just "rip­ples" through the sys­tem). So you get into the "even­tu­al­ly" style again.

Note that this lat­ter sug­ges­tion makes the sys­tem be­have much like the READ COMMITTED iso­la­tion lev­el (which, by the way, is the de­fault in Or­a­cle). So this ap­proach sac­ri­fices con­sis­ten­cy/iso­la­tion a bit in fa­vor of scal­a­bil­i­ty.

Fu­ture work

Ad­di­tion­al re­search could/should be done in the fol­low­ing ar­eas:
  • Im­prov­ing read con­sis­ten­cy through ses­sion affin­i­ty
  • The best way to push the up­dates through the clus­ter
  • Per­for­mance eval­u­a­tion in real life im­ple­men­ta­tions
Fi­nal note and dis­claimer

I did not see Brew­er's orig­i­nal pre­sen­ta­tion of the CAP the­o­rem - so it could be that what he meant with con­sis­ten­cy also in­volved all reads (see the lim­i­ta­tions of the so­lu­tion I pre­sent­ed here). In that case I did not find a so­lu­tion for CAP but at least it is a frame­work and proof out­line for BASE winking face

UPDATE 15/3/2012:

It seems like Greg Young and Udi Da­han have been work­ing along sim­i­lar lines and gave this pat­tern/so­lu­tion a name: CQRS.

RSS

6 Com­ments

1 picture-0ec82e2df0e78b48e3d5dc7f6582457a.jpegAnony­mous|10 Nov 2008 - 14:10|
0

Hel­lo Guy,

Thanks for your very in­ter­est­ing pa­pers on trans­ac­tion. The in­ter­est of the CAP the­o­rem is to say that Con­sis­ten­cy, Avail­abil­i­ty and par­ti­tion­ing can­not be reach per­fect­ly AT THE SAME TIME. More the par­ti­tion­ing is im­por­tant more the CAP the­o­rem be­comes ac­cu­rate. I have been work­ing for many years on dis­trib­uted cache is­sues and when our cus­tomers want to share dat­a­cache be­tween Lon­don, New York and Tokyo, the CAP the­o­rem be­comes fun­da­men­tal. It is easy to say we can get C.A.P at the same time when all your de­vice are at the same place and the Par­ti­tion­ing is weak.

Best re­gards

Paul Perez

2 picture-6335faf0051bc60a4bbb512df5818b68.jpegGuy|10 Nov 2008 - 17:10|
0

Hi Paul,

Thanks. You are right, the im­por­tant no­tion seems to be "at the same time". It's just that that no­tion seems flex­i­ble if you have the op­tion to be asyn­chro­nous (which I as­sume is not eas­i­ly the case for dis­trib­uted caches).

How­ev­er, I do think it is pos­si­ble to present a suite of al­go­rithms that of­fer dif­fer­ent char­ac­ter­is­tics and qual­i­ties of ser­vice. That is what I in­tend to do in the near fu­ture - if time al­lows.

3 picture-0ec82e2df0e78b48e3d5dc7f6582457a.jpegAnony­mous|01 Dec 2008 - 12:24|
0

Hi Guy,
I don't like to be anony­mous (anony­mous said...) My name is Paul Perez, my email is paul.perez@pym­ma.com.

As you said The key point on CAP the­o­rem is the "At the same time" It looks like the Heisen­berg un­cer­tain­ty prin­ci­ple, more you ap­proach quan­tum size more the Heisen­berg prin­ci­ple is ac­cu­rate. With the CAP the­o­rem, more Par­ti­tioned is your sys­tem, less you can get A&P at the same time.

Best re­gards

PPe

I had like you pat­tern on TCC. I try to im­ple­ment it with BPEL or­ches­tra­tion. Are there tech­ni­cal sam­ple on that top­ic. thanks (on my email please)

4 picture-6ee22ae254c2dcbde534869c1578ec98.jpegEn­der|18 Dec 2008 - 13:08|
0

Hi Guy,

I was cu­ri­ous, how re­al­is­tic would this so­lu­tion be? As I un­der­stand it, when you process an up­date from the queue, the ver­sion of the data­base changes, and hence all the oth­er up­dates in the queue (who would have a dif­fer­ent ver­sion num­ber) are re­ject­ed, so you would have a lot of re­ject­ed up­dates, which would mean a lot of emails to send to cus­tomers to tell them to place their or­der again. Or am I mis­un­der­stand­ing some­thing?

Re­gards

5 picture-6335faf0051bc60a4bbb512df5818b68.jpegGuy|18 Dec 2008 - 15:14|
0

Hi,

Ac­tu­al­ly, what I pro­pose is sim­i­lar to op­ti­mistic lock­ing (used in Hiber­nate, and in most web ap­pli­ca­tions any­way): up­dates in the queue are in­deed based on some "snap­shot" of the data and are ap­plied af­ter­wards (when that data might have changed al­ready). This is also how many web ap­pli­ca­tions work - only there the HTTP ses­sion plays the role of the 'queue'.

This tech­nique works pret­ty well ex­cept if you have 'hot spot' data - mean­ing data that changes ex­treme­ly fre­quent due to high con­cur­ren­cy. There has been ex­ten­sive re­search on this (just google for "op­ti­mistic lock­ing" - it has been a while since my re­search days;-)

Guy

6 picture-75dfb2c0adf2a7d6e44916b36f31b559.jpegHow to lever­age Atomikos in the cloud « blog.atom­ikos.com|04 Jun 2011 - 11:24|
0

[...] Luck­i­ly, the so­lu­tion to the prob­lem can be as sim­ple as out­lined in our so­lu­tion to the CAP prob­lem. [...]

Cor­po­rate In­for­ma­tion

Atomikos Cor­po­rate Head­quar­ters
Hove­niersstraat, 39/1, 2800
Meche­len, Bel­gium

Con­tact Us

Copy­right 2026 Atomikos BVBA | Our Pri­va­cy Pol­i­cy
By us­ing this site you agree to our cook­ies. More info. That's Fine