30 Why Can’t We Design a Perfect Distributed Caching System for Massive Data

30 Why Can’t We Design a Perfect Distributed Caching System for Massive Data #

With the development of the Internet, distributed systems have become increasingly important, and current large and medium-sized Internet systems are almost all moving towards the direction of distributed systems. Simply put, a distributed system is a system where software and hardware are distributed on network computers in different data centers or regions, and they communicate and coordinate with each other only through message passing. Distributed systems need to utilize distributed services to provide stable services to the outside world while ensuring data consistency.

The Birth of CAP Theorem #

In the development of distributed systems, there is no theory that has had a greater and more extensive impact than the CAP theorem. It can be said that the CAP theorem is the theoretical cornerstone of the development of distributed systems. As early as 1998, Eric Brewer, a computer scientist from the University of California, proposed three measures for distributed systems. Building on this foundation, two years later, Eric Brewer further put forward the CAP conjecture. Another two years passed, and in 2002, Seth Gilbert and Nancy Lynch from MIT theoretically proved the CAP conjecture. The CAP conjecture became the CAP theorem, also known as Brewer’s theorem. Since then, the CAP theorem has become the theoretical cornerstone of the development of distributed systems, exerting a widespread and profound influence on the development of distributed systems.

CAP Theorem #

img

The CAP theorem, in simple terms, states that a distributed system cannot simultaneously guarantee Consistency, Availability, and Partition Tolerance. The theorem is referred to as CAP because the first letter of each of these three elements translates to C, A, and P, respectively.

Consistency #

img

The first element of the CAP theorem is Consistency. Consistency refers to the concept that “all nodes see the same data at the same time”. This means that all nodes return the same data when accessed at any given time. Another explanation provided by Brewer, the author of CAP, is that after a write operation, the following read instruction must obtain the value written by the write operation or the newly updated value. From the perspective of the server, this means that after a client performs a write operation, the server needs to synchronize this new value across the entire system to ensure that the data is consistent. From the perspective of the client, it means that when accessing concurrently, after modifying the data, how to obtain the latest value.

Availability #

img

The second element of the CAP theorem is Availability. Availability means that “reads and writes always succeed”. This implies that the service cluster is always able to respond to user requests. Another explanation provided by Brewer is that for a node that is not down or experiencing any anomalies, it should always be able to respond to user requests. This means that when a user accesses a properly functioning node, the system ensures that the node must provide a response to the user, which can be a correct response, an old response, or even an incorrect response, but it cannot be no response. From the perspective of the server, it means that service nodes can always respond to user requests without consuming or blocking them. From the perspective of the client, it means that requests always receive a response, and there is no situation where the entire service cluster cannot be connected, timed out, or unresponsive.

Partition Tolerance #

img

The third element is Partition Tolerance. Partition tolerance means that “the system continues to operate despite arbitrary message loss or failure of part of the system”. This means that the system should still provide services externally even in the event of partition faults or communication abnormalities between partitions. In a distributed environment, each service node is not reliable, and communication between different service nodes may encounter problems. When certain nodes experience failures or communication failures with other nodes, the entire system encounters partition issues. From the server’s perspective, having good partition tolerance means that the service cluster can still provide stable services externally when there are node failures or network abnormalities. From the client’s perspective, it means that various failures on the server side are transparent to the client.

Normal Service Scenario #

img

According to the CAP theorem, it is impossible to satisfy all three elements in a distributed system. At most, only two of them can be satisfied simultaneously. Next, we will use the simplest scenario with two service nodes to briefly prove the CAP theorem.

As shown in the diagram, there are two service nodes, Node1 and Node2, connected by a network to form a distributed system. In a normal working business scenario, Node1 and Node2 always operate normally, and the network connection is always good.

Suppose at a certain initial moment, the data in both nodes is the same, which is V0. When a user accesses Node1 and Node2, they will immediately receive a response of V0. When the user updates the data in Node1, changing V0 to V1, the distributed system will construct a data synchronization operation M to synchronize V1 to Node2. Since Node1 and Node2 are both working properly and have good communication with each other, the V0 in Node2 will also be modified to V1. At this point, when users request Node1 and Node2 respectively, they will get V1, ensuring data consistency and always getting a response.

Network Anomaly Scenarios #

img

As a distributed system, there are always multiple distributed nodes that require network connections. The more nodes there are and the more complex the network connections, the greater the probability of node failures and network anomalies. To fully satisfy the three elements of CAP, it means that if there is a network anomaly between nodes, it needs to support network anomalies, which means supporting partition tolerance. At the same time, distributed systems also need to satisfy consistency and availability. Let’s see if this is feasible.

Now let’s continue to assume that at the initial moment, both Node1 and Node2 have data V0. Then, the network between Node1 and Node2 is disconnected. The user sends a change request to Node1 to change V0 to V1. The distributed system prepares to initiate the synchronization operation M. However, due to the network disconnection between Node1 and Node2, the synchronization operation M cannot be synchronized to Node2 in time, so the data in Node2 remains as V0.

At this time, if a user sends a request to Node2, because Node2 is disconnected from Node1 and the data is not synchronized, Node2 cannot immediately return the correct result V1 to the user. So what should be done? There are two options.

  • The first option is to sacrifice consistency, and Node2 responds to the requesting user with the old data V0.
  • The second option is to sacrifice availability, and Node2 continues to block the request until the network connection between Node1 and Node2 is restored and the data update operation M is completed on Node2. Then Node2 can return the correct V1 operation to the user.

At this point, the brief proof process is complete. The entire analysis process also explains that when a distributed system satisfies partition tolerance, it cannot simultaneously satisfy both consistency and availability. It can only be one or the other, further proving that a distributed system cannot simultaneously satisfy consistency, availability, and partition tolerance.

CAP Trade-offs #

CA #

img

According to the CAP theory and the analysis above, we know that distributed systems cannot simultaneously satisfy the three elements of consistency, availability, and partition tolerance. So when building distributed systems, how should we make choices?

Since all three elements are important to distributed systems, and since they cannot be satisfied simultaneously, we should aim to satisfy two of them as much as possible and only abandon one element.

The first option is CA, which means not supporting partition tolerance and only supporting consistency and availability. Not supporting partition tolerance means that partition exceptions are not allowed and that devices and networks are always in an ideal available state, enabling the entire distributed system to achieve consistency and availability.

However, since distributed systems are built by connecting numerous nodes through networks, device failures and network anomalies objectively exist. And the more distributed nodes there are and the wider the range, the greater the probability of failures and exceptions. Therefore, for distributed systems, partition tolerance P is unavoidable. If P is avoided, the distributed system can only be reverted to a single-machine, single-instance system.

CP #

img

The second option is CP. Since partition tolerance P objectively exists, it means giving up system availability in exchange for consistency. So when the system encounters partition exceptions, it will continue to block the entire service until the partition problem is resolved, and then it will resume external services. This ensures data consistency. CP is chosen in many business scenarios, especially those that are particularly sensitive to data consistency. For example, in the payment transaction field and the distributed database field like Hbase, data consistency must be prioritized, and the system will pause services to deal with network exceptions. In distributed systems, Zookeeper is chosen for prioritizing CP when distributing and subscribing to metadata because maintaining data consistency is a basic requirement for these systems. Otherwise, a series of serious problems would arise, such as a bank system having a large number of zero balances, and a database system randomly returning old and new data.

AP #

img

The third option is AP. Since partition tolerance P objectively exists, it means abandoning system data consistency for availability. In this case, when the system encounters partition exceptions and nodes cannot communicate with each other, data is in an inconsistent state. In order to ensure availability, service nodes immediately respond to user requests, but they can only return their own different old and new data. This sacrificing of consistency but ensuring system availability in the face of partition exceptions is very common in internet systems. For example, in the deployment of Weibo in multiple regions, if the network is interrupted in different regions, users within the same region can still post Weibo, comment on each other’s posts, and like them, but they temporarily cannot see new Weibo and interaction states from users in other regions. This is similar to WeChat Moments. Another example is the train ticket booking system like 12306 during peak holiday periods. Occasionally, when booking tickets, you may repeatedly see that a certain train has available seats, but each time you actually click to purchase, it prompts that there are no seats available. In this way, although a small part of the functionality is restricted, the overall stability of the system is maintained and the impact is very limited. Compared to CP, the user experience is better.

CAP Problems and Misunderstandings #

img

The CAP theorem has greatly promoted the development of distributed systems. However, as distributed systems evolve, it has been discovered that the classic CAP theory is overly idealistic and has many problems and misunderstandings.

Firstly, taking the Internet as an example, medium-to-large-scale internet systems have a large number of hosts and are deployed across multiple regions with multiple IDCs in each region. Node failures and network anomalies, leading to partition problems, are common. In order to ensure user experience, theoretically, service availability must be guaranteed by choosing AP, temporarily sacrificing data consistency, which is the best choice.

However, when partition exceptions occur, if the system design is not good enough, it is not as simple as choosing availability or consistency. For example, when a partition occurs, if a region’s system must access a dependent sub-service in another region in order to provide normal service, and at this time there is a network anomaly preventing access to the remote dependent sub-service, this will result in service unavailability and inability to support availability. At the same time, for data consistency, due to network anomalies, data consistency cannot be guaranteed, and the data in each region is temporarily inconsistent. After the network is restored, due to the numerous and complex data to be synchronized, inconsistencies are likely to occur. Moreover, certain business operations may be related to the execution order, so even if all the data is synchronized across different regions, due to different execution orders, the final results will also be inconsistent. After multiple partition exceptions occur over a long period of time, a large amount of data inconsistency accumulates, which continuously affects user experience.

Secondly, in distributed systems, partition problems will definitely occur, but they occur rarely or with a very low probability relative to the time of stable operation. When no partition exists, one should not choose only C or A, but can provide both consistency and availability simultaneously.

Furthermore, within the same system, different businesses and different stages of the same business may have different strategies for choosing consistency and availability when partition occurs. For example, in the previously mentioned 12306 ticketing system, the train inquiry function selects AP, and the ticket purchase function also selects AP during the inquiry stage, but the ticket purchase function selects CP during the payment stage. Therefore, when designing system architecture or functionality, one cannot simply choose AP or CP.

Moreover, in the actual operation of a system, not all elements of the CAP theory are black and white. For example, consistency can be strong consistency or weak consistency. Even if a large amount of data inconsistency occurs temporarily, after a period of time, the inconsistent data will decrease and the inconsistency rate will decrease as well. Similarly, availability can manifest as certain functions being abnormal while other functions are normal, or under heavy load, only supporting requests from a portion of users. Furthermore, partition can also have a series of intermediate states. While complete interruption of regional networks is rare, the conditions of network communication can continuously vary between 0% and 100%. Additionally, different businesses, functions, and components within a system can have different understandings and settings regarding partition.

Finally, the classic CAP theory did not consider the issue of network latency in actual business scenarios. Latency exists from start to finish and partition exceptions, which can be seen as a form of latency, can occur at any time, whether it be 1 second, 1 minute, 1 hour, or 1 day. When designing system architecture and functionality, it is necessary to consider how to define and differentiate these situations and how to respond to them.

The traditional CAP theory does not provide solutions to these problems, and if developers simply make a threefold choice, they will fall into misunderstandings, leading to a multitude of issues during system operation.