The CAP theorem (short for consistency, availability, partition-tolerant) essentially states that you cannot have a clustered system that supports all of the following three qualities:
You could see this as a design pattern for BASE if you like. The solution adheres to BASE in the sense that it uses cached reads (if needed) and that the updates are delayed (so you could say they are "eventually" applied and the system becomes "consistent").
Reflections in scalabilitySo far the CAP focus was on possibility. I think my solution shows that it is possible. Now how about scaling up?
The naive solution (a huge distributed transaction to update all cluster nodes in-sync) is unlikely to scale: as you add more nodes, more updates are needed. Now I am a big fan of transactions, but not to use them in an arbitrary matter. So how to propagate these updates through the cluster?
While smarter solutions for this exist (such as the work by Bettina Kemme), a trivial first try would be to push updates (lazily) to all nodes in the cluster. This can be done with a smart queuing mechanism. The disadvantage is that updates are not applied everywhere at once (rather, the all-or-nothing quality just "ripples" through the system). So you get into the "eventually" style again. Note that this latter suggestion makes the system behave much like the READ COMMITTED isolation level (which, by the way, is the default in Oracle). So this approach sacrifices consistency/isolation a bit in favor of scalability. Future work Additional research could/should be done in the following areas:What makes it even better is that this methodology seems to achieve equal productivity as with the 4GL environment in Forté, which is pretty good given that Java is a 3GL and is not widely known as a productivity miracle.
There is a way around this limitation - although it may sound exotic: just make sure that there are no partitions when requests are served.
How? By simply doing the following:
Since no partition (hopefully) lasts forever, this solution does not lead to livelock.
Also, note that quorum solutions exist to avoid that the complete cluster has to be up at the same time.
Is this the capitulation of CAP? Who knows…
The CAP theorem is essentially a limitation on what you can do with clustered (web) services in the fashionable context of SOA.
The word 'cluster' is important here since that is what it is all about. In particular, the theorem states that you can't have all three properties (Consistency, Availability, Partitioning) in one and the same system (read: service). This implies that there is no perfect solution to building a high-throughput popular service, or is there? Let's first explore what each thing means…
ConsistencyBy consistency, the theorem refers to the property that changes (updates) to the service back-end are visible to later queries. Simplifying: if you add something to your shopping basket then it will appear there next time you retrieve your basket status. That sounds trivial, but it is not if the basket is spread over multiple physical server processes… Consistency is commonly ensured (between processes) by having some sort of distributed transaction coordinator, or (assuming a central back-end) a single centralized database.
AvailabilityThe Lynch paper uses a very simple but sufficient definition of "availability": a system is available if every request to it returns. In other words: there is no infinite blocking.
PartitioningPartitioning means the cut-off between two segments of the cluster. In other words, one or more nodes become unreachable for at least some time.
What is the Theorem saying?You can't have all three of the above qualities, period. However, you can combine any two of them if you like. This is proven in the paper by Lynch et al. Also (and this is important) you can apply different combinations of qualities to parts of your system. Meaning: you can stress consistency in one part, availability in another part, and so on. For instance, order processing or payment processing can be done consistently and available (sacrificing partition tolerance) whereas querying the product catalog can be done differently (stressing partition tolerance in favor of consistency).
Does this contradict or invalidate Atomikos?Not at all, quite the contrary: it makes Atomikos (and its third generation of TP monitors) all the more relevant. Why? Because Atomikos products can help you in making those parts consistent when you want them to be.
Virtually achieving all three qualitiesIf you embrace asynchronous messaging (a la JMS or email) and extreme transaction processing (XTP) then it is possible to asymptotically realize all three qualities (consistency, availability, partition-tolerance) provided that you do use a callback mechanism to communicate results (e.g., by sending a confirmation email). Here is how:
Now did I just break the CAP impossibility? More on this in a next post…