Developers working with data have likely heard about the CAP Theorem. Developed by Dr. Eric Brewer, professor of Computer Science at UC Berkeley, it means that theoretically in computer science, it is impossible for a distributed data system to provide all three of the following attributes:
- Partition Tolerance
Conceptually, you could have two of these three. However, as a practical matter, networks are going to fail from time to time. Therefore, partition tolerance is a must-have in a distributed system, as you wouldn’t want the entire system to crash due to a fault in one node or server. This basically means that the real choice is between Strong Consistency and High Availability.
The easiest way to understand CAP is to think of two nodes on opposite sides of a partition. Allowing at least one node to update state will cause the nodes to become inconsistent, thus forfeiting C (consistency). Likewise, if the choice is to preserve consistency, one side of the partition must act as if it is unavailable, thus forfeiting A (availability). Only when nodes communicate is it possible to preserve both consistency and availability, thereby forfeiting P. The general belief is that for wide-area systems, designers cannot forfeit P and therefore have a difficult choice between C and A.
As Dr. Brewer says, this is a difficult choice. Choosing between a Strongly Consistent database, or a Highly Available one will produce have pros and cons, so let’s take a look at where it makes sense to choose one over the other in relation to the CAP Theorem.
Highly Available Databases
Highly Available databases are a good choice when it is essential that all clients have the ability to read and write to the database at all times. This doesn’t mean that what is written to the will be instantly available to all who might want to read the data. There will be a delay of some sort, but eventually, it should be available. This is sometimes referred to as eventual consistency.
In the real world, we can see this eventual consistency happening with some social media channels. You make a post, but there might be a delay of a few minutes or more before everyone can see it. This isn’t mission critical, and generally, we as users tend to prefer that as opposed to not being able to access the social network at all.
At some point, of course, everyone will be able to see the post in question – it’s just a matter of time. This can be called Eventual Consistency, and it works well enough for this type of use case.
Strongly Consistent Databases
We can all probably agree that a slight delay in seeing a social media post isn’t a significant problem. It will still be more or less relevant by the time it is seen. But what about other use cases? What if, as an example, you need to be absolutely sure that the data in question has to be accurate and definitive? Per the constraints of the CAP Theorem, this is where you’d want to work with a Strongly Consistent database.
If you were to watch a streaming video on Netflix or a similar platform, you might pause the video in mid-stream. Later, you might wish to pick up where you left off on another device. You may well find that there is a 5-10 second bit of the video that you have already seen. Again, not a big deal, right? Being off by 5-10 seconds is not going to hurt anyone.
But let’s look at a different use case – let’s say, for the sake of example, you have a bank account dedicated to the needs of two children in college. Both of your children have the ability to access the account via an ATM card. Even though the two children may be using different ATMs, what were to happen if they were each to attempt to withdraw $80 from a $100 balance?
With a strongly consistent database, even if the attempts were separated by a millisecond, only the first transaction should go through. The second child should get a message stating that the requested transaction exceeds the available balance. A sad fate for the second child (and perhaps you as your cell phone rings with a plea for more funds), but with a strongly consistent database, the risk of overdrafts and related fees is reduced.
Equally important, it might be critical to know at the precise date and time when a specific action took place. This is useful to detect fraud or other activities which might warrant additional scrutiny.
At Cinchapi we have chosen to develop a Strongly Consistent database, which we have named Concourse. It is primarily intended for use by developer who need be certain that the data that they are working with is accurate. We call this ‘definitive data’.
Is there much of a trade-off? After all, the whole point of High Availability is to be available as much as possible. Doesn’t that mean that a Strongly Consistent database mean that the trade-off is speed?
Not necessarily. We’d suggest that the trade-off is more about accuracy as opposed to speed. If you have ever attempted to take advantage of a flash sale on an online reseller’s website, you may understand where they make the trade-off. The moment the flash sale begins, users across the country are all trying to get the item into their shopping cart.
But, with limited numbers of items available for sale, some people are going to be disappointed. It may have appeared that an item was in the shopping cart, but by the time the user in question goes to check out, an error message appears to indicate that the item is no longer available.
Frustrating to the user, but for the reseller, it is better to disappoint a few customers rather than have the site go down due to an overload.
In an ideal world, we’d be able to have High Availability while being Strongly Consistent, but the laws of physics and the CAP Theorem dictate that hard choices must be made. Think about your target audience, the numbers of anticipated users, and how important it is to be working with definitive data at all times. The answers to those questions should lead you to the right decision.