The CAP theorem (short for consistency, availability, partition-tolerant) essentially states that you cannot have a clustered system that supports all of the following three qualities:
- Consistency is a quality meaning (informally speaking) that reads and writes happen correctly. In other words, the overall effect of executing thousands or millions of transactions concurrently is the same as if they had been executed one-at-a-time. Usually, this is done with the help of a transaction manager of some sort.
- Availability essentially means that every operation (that makes it to a non-failing node) eventually returns a result.
- This quality refers to the possibility of tolerating partitions on the network. Note that we suppose a cluster architecture (which is where the network comes in).
- a CAP solution
- how this conforms to what BASE wants to achieve
- a "design pattern" for building correct systems that (in a way) offer both CAP and BASE qualities
- Process reads from the database if possible, or use a cached value if needed for availability (if the DB is unreachable).
- All reads use versioning or another mechanism that allows optimistic locking.
- Updates supplied by clients (orders in case of Amazon) are queued for execution, and include the versioning information of the reads that lead to the update.
- Queued updates are processed when the number of partitions is low enough to do so. The easiest way to do this is with a cluster-wide distributed transaction across all replicas (more on scalability later), but other more refined ways are possible (such as quorum-based replication or any other smart way of replicating). The version information in the update is used to validate it: if the data in the database has been modified since the original read(s) that lead to the update, the update is rejected and a cancellation is reported back to the client. Otherwise the order is processed and a confirmation is reported back to the client.
- The results (confirmation or cancellation) are sent asynchronously to the clients. This can be either email, message queuing, or any other asynchronous delivery method.
- This system is consistent because reads are based on snapshots and incorrect updates are rejected before they are applied. In other words: there are no incorrect executions.
- This system is available since reads always return a value, and so do writes (even though they are queued and it may take a while).
- This system is partition-tolerant because it allows network and node failures.
- Read-only requests may be presented with stale information (due to updates that have yet-to-be-applied). In that sense, their results could be "inconsistent": for instance, the availability of an Amazon item can change between two page views. I do not see this as a major restriction, since no website that I know of will offer read consistency for the duration of a user session. It all depends on what you consider to be within the scope of one transaction;-) Note that this almost corresponds to snapshot isolation found in Oracle.
- Partitions should not last forever: in order for this to work, partitions should be resolved within a reasonable time (reasonable being: within the expected confirmation time for updates). The duration of any partitions also affects the time window in which reads can produce stale data.
- The updates have to be applied in the same relative order at all cluster nodes. This puts some restrictions on the algorithm used to do this.
You could see this as a design pattern for BASE if you like. The solution adheres to BASE in the sense that it uses cached reads (if needed) and that the updates are delayed (so you could say they are "eventually" applied and the system becomes "consistent").Reflections in scalability
So far the CAP focus was on possibility. I think my solution shows that it is possible. Now how about scaling up?
The naive solution (a huge distributed transaction to update all cluster nodes in-sync) is unlikely to scale: as you add more nodes, more updates are needed. Now I am a big fan of transactions, but not to use them in an arbitrary matter. So how to propagate these updates through the cluster?While smarter solutions for this exist (such as the work by Bettina Kemme), a trivial first try would be to push updates (lazily) to all nodes in the cluster. This can be done with a smart queuing mechanism. The disadvantage is that updates are not applied everywhere at once (rather, the all-or-nothing quality just "ripples" through the system). So you get into the "eventually" style again. Note that this latter suggestion makes the system behave much like the READ COMMITTED isolation level (which, by the way, is the default in Oracle). So this approach sacrifices consistency/isolation a bit in favor of scalability. Future work Additional research could/should be done in the following areas:
- Improving read consistency through session affinity
- The best way to push the updates through the cluster
- Performance evaluation in real life implementations