31 How to Design a Reliable Distributed Caching System to Satisfy the Needs of Large and Medium Sized Mobile Internet Systems

31 How to Design a Reliable Distributed Caching System to Satisfy the Needs of Large and Medium-sized Mobile Internet Systems #

In the previous lesson, we learned why it is impossible to design a distributed system that simultaneously satisfies consistency, availability, and partition tolerance. In this lesson, we will discuss how to design a distributed system in practice to meet the requirements of large and medium-sized Internet systems.

Breakthrough in Traditional CAP Theory #

With the continuous evolution of distributed systems, various problems will be encountered. Specifically, in the evolution of large and medium-sized internet systems, private and public clouds are developing in parallel and integrating with each other. The deployment of internet systems has already exceeded single-region boundaries and moved towards multi-region deployment across the country and even globally. While adhering to the traditional classical CAP theory, we need to recognize the complexity of the three elements of CAP and not simply categorize CAP theory into choosing two out of three. Instead, we need to innovate, modify, and break through the CAP theory based on the characteristics of the business and deployment.

Even the creator of the CAP theory, Eric Brewer, himself made further corrections, expansions, and evolutionary explanations to the CAP theory 12 years after its introduction in 2012. Brewer pointed out that the classic formula of choosing two out of three in the CAP theory is misleading. The traditional practice of CAP theory oversimplifies the three elements and their interrelationships. He also compared CAP with ACID and BASE, analyzed the relationship between CAP and latency, and finally focused on how distributed systems can deal with partitioning anomalies.

To break through the traditional CAP theory and practice, we need to recognize that none of the three elements of CAP are black and white; instead, they exist as a series of possibilities. It is a significant challenge to design a distributed system’s architecture in practical business scenarios.

During the actual operation of the system, most of the time, partitioning anomalies will not occur, so good consistency and availability can be provided. At the same time, in the system’s architecture design, besides analyzing how to implement business functions and achieve system SLA metrics, we also need to consider how to handle potential partitioning issues in the entire system architecture, including various business, modules, functions, and system deployments.

To handle potential partitioning issues effectively, the following steps can be taken.

Firstly, consider how to detect the occurrence of partitions. Detection can be achieved through active probing, state reporting, special time/special event warnings, historical data forecasting, etc., to promptly identify partitions.

Secondly, if a partition is detected, how to handle business processes under partitioning. This can be done by using memory buffering or queue services to save data and continue serving, directly suspending services for sensitive functions, or further classifying partitions. If it is a short-term delay, some functions or requests can be blocked and wait for the results, while other functions and requests can quickly return local stale data. If the partition duration exceeds a certain threshold, some functions can be taken offline, only providing core functionalities.

Finally, after partition anomalies are restored, how to synchronize and repair data and establish a compensatory mechanism to deal with errors during the partition mode. For example, introduce a message queue as part of the system design, save the modified data in the message queue during the partition mode, and allow the message processing engine to read and repair the data once the partition is restored. It can also be designed as a synchronization mechanism where, during partition anomalies, the last synchronized position point is recorded, and data synchronization continues from that point after the partition is restored. Another approach is that when partitioning occurs, each region of the distributed system keeps track of its unsynchronized data, and after the partition is restored, actively compares and merges the data from different regions. Finally, after the failure is recovered, data scanning can be used to compare and repair partitioned data.

BASE Theory #

The BASE theory was initially proposed by Brewer and his colleagues. Although it is relatively old, it is more vibrant in the current Internet industry. Major Internet companies, when building large and medium-scale distributed Internet systems, including various distributed systems based on private cloud, public cloud, and multi-cloud combinations, strive to learn from the CAP theory and practice, and also fully verify and practice the BASE theory. This theory is used as an extension of the CAP theory and is well applied in various Internet systems.

The BASE theory and practice are the result of the trade-off between consistency and availability in distributed systems. The basic idea is to appropriately weigh various functions of the distributed system and strive to maintain the stability and availability of the entire system, even in the event of local failures and exceptions, and ensure the availability of the main functions of the system and its ultimate consistency.

The BASE theory also includes three elements: Basically Available, Soft state, and Eventual Consistency.

Basically Available #

Basically available means that a distributed system allows for a partial loss of availability in the event of a failure. For example, it may lose part of the service level agreement (SLA), such as slightly increased response time and slightly decreased processing performance. It can also sacrifice peripheral functions or even some core functions. Ultimately, it ensures the basic stability of the system’s main functions. For example, during peak periods such as Singles’ Day, Taobao and JD may experience slower response times, but after a slight delay, they still return correct results. At the same time, some requests may be redirected to degraded pages, etc. Another example is Weibo, which may disable some peripheral functions during an unexpected failure and allocate resources to ensure the core functions such as refreshing feeds and posting.

Soft State #

Soft state refers to allowing the system to exist in an intermediate state. When a failure occurs, data synchronization between partitions may experience delays or pauses, and the data in each region may be inconsistent. However, this state does not affect the system’s ability to continue providing services externally. This inconsistency in node states is called soft state.

Eventual Consistency #

Eventual consistency means that a distributed system does not need to maintain strong consistency in real-time. It can tolerate data inconsistency when a system failure occurs. After the system failure is resolved, data is synchronized and eventually reaches a consistent state again.

The BASE theory is proposed for large and medium-scale distributed systems, and it is more suitable for current large and medium-scale Internet distributed systems.

First, user experience comes first, and availability should be the priority in system design.
Secondly, in the event of a failure, it is possible to sacrifice the availability of certain functions and the strong consistency of data in order to maintain the availability of the core functions of the system.
Finally, after the system failure is resolved, various strategies are employed to ensure that the system eventually reaches consistency again.

Consistency Issues and Countermeasures #

In distributed systems, to maintain system availability and performance, data needs to be stored in multiple replicas distributed across different physical machines. If there are failures in servers or networks, some data replicas may successfully write while others may fail, resulting in inconsistent data among replicas and conflicting data content, leading to data inconsistency. Therefore, the key to maintaining consistency in distributed systems is to solve the problem of data consistency.

There are several solutions to maintain data consistency, including distributed transactions, master-slave replication, and business-level message buses.

Distributed Transactions #

In distributed transactions, the transaction is only committed if all nodes can execute the series of operations within the transaction successfully; otherwise, it will be rolled back. This ensures strong consistency of data within the system. Distributed transactions are widely used in various applications. For example, in a cross-bank transfer scenario, when user A transfers money to user B, the decrease in account A and the corresponding increase in account B must be part of a distributed transaction. Other scenarios, such as ticket purchase on a railway ticketing website or buying funds on Alipay, also require transactional operations.

There are various specific solutions for distributed transactions, including two-phase commit (2PC), three-phase commit (3PC), Paxos, Zab, Raft, etc.

In the two-phase commit protocol, the system consists of two types of nodes: coordinators and transaction participants. There is usually only one coordinator, while there can be multiple participants, which can be understood as the number of data replicas.

The execution of the two-phase commit is divided into the request phase and the commit phase. In the request phase, the coordinator notifies the transaction participants to prepare for committing or canceling the transaction. After receiving the notification, the participants start to vote. During the voting, if a participant successfully executes the local operation, it votes to agree; if the execution fails, it votes to cancel and sends the vote back to the coordinator. Then it enters the commit phase.

In the commit phase, the coordinator makes a decision on whether to commit or cancel the transaction based on the voting results from the first phase. The decision is made based on whether all participants vote to agree or not. After making the decision, the coordinator disseminates the decision to all transaction participants. Upon receiving the decision from the coordinator, the participants execute the corresponding operations.

The three-phase commit is similar to the two-phase commit, but introduces timeouts for both the coordinator and the participants and divides the first phase of the two-phase commit into two steps: first, it asks for votes and then locks the resources.

Distributed transaction solutions like Paxos, Zab, and Raft share similar ideas. Each data replica carries version information, and every write operation ensures that it is written to more than N/2 replicas, while every read operation ensures that it reads from more than N/2 replicas, using the majority vote as the final decision-making process. This arbitration method is widely used in the industry. For example, Amazon’s Dynamo storage uses a similar approach. Dynamo’s decision-making process is simpler—it only requires the sum of write operations and read operations to be greater than the number of nodes. The entire arbitration process is usually performed by coordinators, but it can also be flexible, allowing business clients, like Dynamo, to make decisions. After the arbitration, the arbitrator can choose the correct version of the data and even merge different versions into a new data in certain scenarios.

Master-Slave Replication #

Master-slave replication is another widely used consistency solution. It is widely used in various databases such as MySQL, and Redis which was mentioned in a previous course also uses master-slave replication to ensure consistency between the master and the slave.

In addition to ensuring consistency at the data layer, it is also possible to update the cache and storage systems through a message bus at the business layer. This is often considered in the design of geo-redundancy solutions by internet companies.

With a message bus, messages are distributed among different regions, using either a push or pull approach. Generally, the pull approach is more aware of distribution, making it more capable of guaranteeing data consistency in the event of network failures.

Case Study on Consistency of Data in Distributed Systems Across Multiple Regions #

As shown in the figure above, this is a case study of how Weibo ensures consistency of data across multiple regions. Messages are distributed through the message middleware WMB. There are two regions in the distributed system on both sides of WMB. All write operations by users in each region are encapsulated as messages. Business messages are first written to the message queue service, and then the message queue processor reads and updates the cache and the database (DB). When a business message is written to the message queue service, WMB simultaneously distributes this message to all other subsystems in different regions. The distribution is done by the local component of WMB writing the message to the local queue, and then the remote component Client of WMB reads it. In the event of a partition failure, if remote reading fails, the message will still exist in the message queues of each region, and it won’t be lost. During a partition failure, the subsystems in each region only process local events. After the partition failure is resolved, WMB Client continues to read remote messages, which will be executed by the message processor, ultimately achieving data consistency.

Since WMB synchronizes at the business level through the message queue, when a partition failure occurs, each region executes locally first and then remotely when the partition is restored. Therefore, there may be differences in the execution order of events in each region, which can potentially result in data inconsistency in certain extreme scenarios. Therefore, Weibo only uses WMB for cache updates, while the DB layer still uses a master-slave replication approach for strong consistency. Even if there may be a small amount of temporary inconsistency in the cache data during the recovery period, this portion of the data can still maintain eventual consistency after being reloaded from the DB, mainly because a shorter expiration time is used for the recovery data. Also, Weibo does not update the cache with DB data because the data structure of the cache is too complex and often needs to be expanded based on business needs. A single cache record involves multiple DB records and multiple records in Redis. There are too many factors involved in synchronizing data with DB to trigger cache updates, making it uncontrollable. Therefore, after the failure of the DB-driven cache update approach, Weibo switched to using the WMB message queue for cache updates.