CAP Theorem - Consistency Availability and Partition Tolerance

Nayan Pahuja - Aug 26 '23 - - Dev Community

Introduction:

In the past when we had just started out our little technology world, If we wanted to do work faster or our current machines were slow, we were limited to vertical scaling which is getting a more Powerful machine or further optimizing of our code to reduce load.

But in the modern era, we have come so far from that. With the help of Parallel Processing and distributed systems, Now if we are ever facing issues, we can also scale horizontally.
Horizontal Scaling means to get more machines to do the same task simultaneously.

Scalability and It's need:

So to put in a real life example difference between horizontal and vertical scaling is:

Suppose you have a good restaurant and you're the best in your town.

Suddenly you are not able to serve all the customers you want as you are running out of capacity to fill them all in one restaurant.

You come up with an idea: You think of a buying a new restaurant with increased capacity and multiple floors for extra spacing solving the limited capacity problem (Vertical Scaling).

Just as you were going to execute this, your friend comes with an idea and suggests you that instead of buying a new restaurant , that you should open up a franchise in the town and divide your customers so that the load on current restaurant decreases and you can reach maximum potential (Horizontal Scaling).

Let's get back to our original goal, what is CAP Theorem?

  • CAP is an acronym for Consistency, Availability, and Partition Tolerance. The theorem states that any system can have and excel in only two of the three properties.
  • We can already see a bunch of data manipulation tools in the Apache project like Spark, Kafka and Storm. However, in order to effectively pick the tool of choice, a basic idea of CAP Theorem is necessary.

Let's define all the three properties.

Consistency:

Being consistent means guaranteeing that every node in a distributed cluster returns the same, most recent, successful write. Consistency refers to every client having the same view of the data.

To put it simply: Every franchise of our restaurant has the same menu and same taste across all our franchises, so that the customer/client experiences the same thing across all franchises.

Availability:

Availability in the context of distributed systems refers to the guarantee that every request made to the system receives a response, either successful or unsuccessful, without delays. An available system ensures that users can access the system's services and data when needed.

To put it simply: Just like each franchise of your restaurant is always open and ready to serve customers, an available distributed system is always ready to respond to user requests, providing the necessary information or performing the requested actions.

Partition Tolerance:

Partition tolerance refers to a system's ability to continue functioning and providing consistent responses even when communication failures occur between different nodes or components within a distributed system. In other words, it's the system's ability to handle situations where nodes can't communicate with each other due to network issues, hardware failures, or other reasons.

To put it simply: Imagine each franchise of your restaurant operates independently and can continue serving customers even if they can't communicate with each other. For example, if one franchise loses its internet connection, it can still take orders and serve food without relying on the others.


CAP Theorem is very important in the Big Data world, especially when we need to make trade off’s between the three, based on our unique use case. On this blog, I will try to explain each of these concepts and the reasons for the trade off. I will avoid using specific examples as DBMS are rapidly evolving.


Relational databases exhibit vertical scalability, whereas non-relational databases excel in horizontal scalability. The latter adopts a distributed design, expanding across the network by connecting with multiple nodes. Typically, relational databases align with the CA model, prioritizing Consistency and Availability. In contrast, non-relational databases adhere to either the AP model, focusing on Availability and Partition Tolerance, or the CP model, emphasizing Consistency and Partition Tolerance. To gain a more comprehensive comprehension, let's explore these characteristics individually.

1. CP Database:

Imagine a group project where everyone needs to agree on the same information before moving forward. In a CP database, if there's a disagreement between two teams (nodes), the project stops. They wait until they all agree, and this makes some parts of the project unavailable.
This is what it means to be consistent and available.

2. AP Database:

Think of a big party where people can keep dancing even if the music temporarily stops for some groups. In an AP database, if there's a glitch, some people might dance to slightly different rhythms for a while, but they keep dancing. When the music comes back, they all sync up again, so they're in harmony once more.

We are still available and once the music get's back we are all in sync. But when the music has stopped everyone is doing random moves asynchronously to keep the rhythm going hence being inconsistent.

3. CA Database:

Imagine a sport team that can never play if even one player is missing. In a CA database, if that player can't join the game, the whole team sits out of the match. They value everyone being there together. This is like a band that can only perform when all the members are present, ensuring everyone is on the same page.

Conclusion:

The CAP theorem is a fundamental trade-off in distributed systems design. There is no one-size-fits-all solution, and the best choice of database will depend on the specific application requirements.

Every distributed system can choose two from the three properties according to the requirement. If you are okay with your system getting down occasionally, we can let go of availability and use CP, if not go with either AP or CA. Further, it is crucial that in the system with authentic transactional data related to finance, you need to value Consistency.

I suggest giving this great article a read if you still have any doubts:
CAP THEOREM

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .