You are here: Blog » Vision

Vision

My take on CAP

03 Sep 2008 | Guy Pardon | Vision
The CAP theorem (Consistency, Availability, Partitioning) has been receiving quite a lot of interest lately, just to mention one of the many references.

What is CAP about?

First let me give credits here: I am deriving my inspiration from the theoretical insights found in this paper co-authored by one of my favorite woman scientists, Nancy Lynch from MIT. If you get a chance to read this paper, go ahead it will bring you some very useful fundamental understanding…

The CAP theorem is essentially a limitation on what you can do with clustered (web) services in the fashionable context of SOA.

The word 'cluster' is important here since that is what it is all about. In particular, the theorem states that you can't have all three properties (Consistency, Availability, Partitioning) in one and the same system (read: service). This implies that there is no perfect solution to building a high-throughput popular service, or is there? Let's first explore what each thing means…

Consistency

By consistency, the theorem refers to the property that changes (updates) to the service back-end are visible to later queries. Simplifying: if you add something to your shopping basket then it will appear there next time you retrieve your basket status. That sounds trivial, but it is not if the basket is spread over multiple physical server processes… Consistency is commonly ensured (between processes) by having some sort of distributed transaction coordinator, or (assuming a central back-end) a single centralized database.

Availability

The Lynch paper uses a very simple but sufficient definition of "availability": a system is available if every request to it returns. In other words: there is no infinite blocking.

Partitioning

Partitioning means the cut-off between two segments of the cluster. In other words, one or more nodes become unreachable for at least some time.

What is the Theorem saying?

You can't have all three of the above qualities, period. However, you can combine any two of them if you like. This is proven in the paper by Lynch et al. Also (and this is important) you can apply different combinations of qualities to parts of your system. Meaning: you can stress consistency in one part, availability in another part, and so on. For instance, order processing or payment processing can be done consistently and available (sacrificing partition tolerance) whereas querying the product catalog can be done differently (stressing partition tolerance in favor of consistency).

Does this contradict or invalidate Atomikos?

Not at all, quite the contrary: it makes Atomikos (and its third generation of TP monitors) all the more relevant. Why? Because Atomikos products can help you in making those parts consistent when you want them to be.

Virtually achieving all three qualities

If you embrace asynchronous messaging (a la JMS or email) and extreme transaction processing (XTP) then it is possible to asymptotically realize all three qualities (consistency, availability, partition-tolerance) provided that you do use a callback mechanism to communicate results (e.g., by sending a confirmation email). Here is how:

  • Queue requests in JMS.
  • Process each request transactionally (so failures will leave the request queued for retries).
  • The process that digests each request can be arbitrarily complex and use transactions (consistency) and return whenever it likes (thanks to the queuing, no reply is expected within a preset time frame).
  • Any lack of availability of the processing is recovered by the queues: failed requests will stay queued until the process in the back-end is in fact available again.

Now did I just break the CAP impossibility? More on this in a next post…

Suppose you want to develop a high-volume transaction processing system in Java/J2EE. How would you do it? Most people would say: don't use JTA/XA transactions because they kill performance. Wrong. And they would also say: use an appserver to scale. Again, they couldn't be more wrong.

Here is the magic recipe on how we build systems with virtually unlimited scalability at Atomikos:

  • Kick out your appserver as soon as you can, as explained here. J2EE is not limited to an appserver. J2EE is a set of APIs. The appserver ties these APIs to a programming model that almost nobody needs. Conclusion: drop the latter.
  • Use a persistent JMS queue to store transaction requests. This allows easy load-balancing and provides crash resilience for ongoing requests. It also de-couples the clients from the transaction processing system.
  • Use ExtremeTransactions to process the requests (stored in JMS). This allows for reliable, exactly-once message processing as outlined here. Make sure to use the supplied JMS and JDBC drivers!
  • To add more power, just add a second VM (process) on a separate CPU.
  • Repeat until performance is high enough.

You will reach the required performance because of the intra-VM nature of each process you add. The only potential bottlenecks are your own database or JMS backend. So scaling comes down to scaling your backends, which is much simpler than scaling your application itself (which has already been done in a natural way as outlined above).

So don't let anybody fool you: transactions do scale - even without limits!.

Extra: Elastic scaling

Our commercial product ExtremeTransactions includes elastic scaling features for cloud deployments. You can apply for a free trial below…

Try ExtremeTransactions For Free

I have talked to a number of people who claim to be doing SOA, when in the end all they do is loosely-coupled design. Let me explain what I mean by an example.

A team of enterprise architects was designing an SOA infrastructure for a bank I know. The system they were building would be based on interfaces, so that it would be possible to deploy parts of the system as separate instances later on. This was their notion of SOA…

The good thing about it is that there are interfaces in their design, meaning it is likely to be loosely-coupled. The bad news is that this is not SOA, at least not in my view: one of the biggest advantages of SOA - reuse in place - is never realized in this way. So, whereas this approach to 'SOA' may be loosely coupled in design, it is not loosely coupled in deployment (which is at least as important).

The consequence? Whenever a 'service' is upgraded, they will need to upgrade all the dependent services and redeploy them. This is because each 'service' is really an embedded module inside other parts of the system.

I guess this also holds for the debate on cloud vs grid computing: in my view, a cloud is more loosely coupled than a grid in its deployment.

Is BPEL a good tool for implementing compensation? It really depends, and you really have to know what you are doing - which (with all respect) doesn't seem the case for most people (not even BPEL specialists). So if not even those experts know, how can we expect the rest of us to know? Hence this blog entry.

For instance, on repeated occasions I have heard renowned BPEL and workflow experts mention that compensating transactions are "perhaps" best modeled at the business logic level. This, by the way, includes Bill Burke in the case of JBoss/jBPM - see here. Note that I emphasized the word "perhaps": this indicates the shade of misunderstanding usually present in the arguments.

I have been saying this here and there in the past (and in fine detail in this article), but I want to repeat it again: BPEL, nor workflow nor WS-BA are ideal for compensation unless the compensating party doesn't care whether it needs to compensate eventually. In other words, if the compensation is business as usual to the provider of the compensatable service then BPEL might be OK (though certainly not desirable - see below).

Why is that? Put yourself in the place of a service that is asked to compensate by a BPEL engine somewhere. Also suppose that you are in a B2B ecosystem where you don't necessarily trust the party that owns the BPEL engine. Now what would you rather do: trust the BPEL to compensate - eventually (which might be never!) or rather deal with compensation yourself, say after a timeout? I would definitely choose the latter. I don't want someone else to decide when I need to compensate. I want to decide for myself, and the Atomikos TCC model allows for that. BPEL and jBPM don't.

So BPEL is ruled out for me - at least as far as compensation goes. What about WS-BA? It is a step in the right direction, but unfortunately it is a bloated protocol, very inefficient and loaded with application-level messages that pollute the compensating part. Even worse, it also suffers in a large part from the lack of timeout and depends on the BPEL to at least trigger compensation.

Also, WS-BA doesn't allow for application logic on close - I won't go and bother you with the entire spec details but it is like a try..catch…finally where the exception is raised by the client (ugly!) and where the finally block can only be empty! Again, Atomikos TCC is far superior, more efficient and more elegant. It is also more natural for compensation than any BPEL engine will ever be.

One last note on BPEL and this supposed "modeling the compensation in the business process": I was talking to an IBM architect the other day. He said that they were doing a large telco project with BPEL to co-ordinate things. One of the things he complained about was exactly this: they have to model the compensation and error logic as explicit workflow paths, and it was literally overloading everything with complexity. Moreover, this complexity is hard to test. As he correctly put it, they were implementing a transaction manager at the business logic (BPEL) level, over and again in every process model. In addition, this was also hard to test he said and that it was virtually killing the project - especially if there were change requests to consider. I believe him:-) I gave him the URL to our TCC article above.

Atomikos and TCC allow you to focus on the happy path of your workflow models. We take care of the rest. Now imagine what a reduction in complexity that is, and how much more reliable things get! So no, compensation should NOT be modeled at the business level. Except on rare occasions maybe.

Whenever I see a presentation on REST I am impressed by its simplicity. With just four operations (GET, POST, PUT, DELETE) it seems to accomplish a simple model for service-oriented architectures, where every business resource has a URL.

With this simplicity, REST also leverages the ubiquitous HTTP protocol as the underlying mechanism. More and more people seem to like this, including me.

However, the big question for me is: how do you make this reliable? Imagine that you integrate 4 systems in a REST style. You would be using HTTP and a synchronous invocation mechanism for each service. Now comes the question: how reliable is this? The answer: less than the least reliable system that you are using! More precisely, availability goes down quickly because your aggregated service fails as soon as one of the services fails…

With transports like JMS you can improve reliability, but how do you do REST of JMS, given its close relationship with HTTP and URLs? That is the problem with REST for me.

Corporate Information

Atomikos Corporate Headquarters
Hoveniersstraat, 39/1, 2800
Mechelen, Belgium

Contact Us

Copyright 2026 Atomikos BVBA | Our Privacy Policy
By using this site you agree to our cookies. More info. That's Fine