Atomikos - Comments on The Achilles heel of the CAP theorem

» Only process reque ...

barry — 2009-03-28T21:45:54+01:00

» Only process requests when there is no partition problem.

Doesn't this mean that you are sacrificing availability? You've turned a failure in partitioning into a failure in availability. While the answers and responses are queued so no request or response is lost, that doesn't mean all is well. A response may take a long time to come back which is as much of a problem as getting an error.

Hi Dan, Sure have I ...

Guy — 2009-01-20T20:30:00+01:00

Hi Dan,

Sure have I read "Impossibility of distributed consensus with one faulty process" - it is at the basis of the heuristic exceptions in all two-phase commit solutions (including Atomikos).

However, what I am saying is that the failure usually only lasts for so long, and afterward things can move on. Exploiting the right tools to do that can help availability.

That is the main advantages of (persistent) queues and that is all I am saying. Lynch et al do not seem to exploit it as much as they could...

Guy

Have you read: http ...

PetrolHead — 2009-01-20T20:11:00+01:00

Have you read:

http://portal.acm.org/citation.cfm?id=214121

"Impossibility of distributed consensus with one faulty process"?

This is an important result and has significance to your comments and the CAP theorem. Essentially one can't tell the difference between a genuine failure and a slow running machine or busy network.

Thus your solution might work for a very small number of machines all in a single data-centre but for larger installations, failure of machines, routers, switches, cables etc will happen several times a day and thus quorums and clusters become considerably less practical and loose consistency more attractive.

Note also that the theorem isn't just about clustered services in the traditional sense but also services that run across multiple data-centres.

I also have a specific observation:

"....note that quorum solutions exist to avoid that the complete cluster has to be up at the same time."

This is true but they are limited by a number of factors practically:

(1) The assumption that you will have a majority - seemingly this is straightforward but a partition plus a loss of a machine can leave you without a majority.

(2) Getting all members back into sync. Can require all sorts of special admin involvement and it can go wrong.

(3) Performance - quorum protocols especially across enough nodes to ensure survival can be slow.

(4) Ensuring that clients don't continue to make use of the minority during a partition e.g. reporting out-of-date information.

(5) You can have a cluster capable of achieving consensus but you can't reach it because the network is broken between cluster and clients.

Best,

Dan.
http://www.dancres.org/blitzblog