16 Revisiting Consistency – Consistency Models Beyond Cap

16 Revisiting Consistency – Consistency Models Beyond CAP #

In “05 | Consistency and CAP Model: Why Distributed Consistency is Needed”, we discussed the important concept of consistency models in distributed databases. Due to space constraints, I only mentioned a few strongly consistent models for data nodes (servers) in that section. So in this lecture, we will continue to discuss the remaining consistency models, including client (session) consistency, eventual consistency, and so on.

Now let’s complete the knowledge system of consistency models together.

Complete Consistency Model #

The complete consistency model is shown in the following diagram.

Drawing 0.png

The different colors represent the degree of availability. Let me explain in detail.

  1. Pink represents complete unavailability after network partition. This corresponds to CP-type databases.
  2. Yellow represents strict availability. When a client continuously accesses the same database node, it remains available even after a network partition. It is considered an AP database at the data or server side, but a CP database from the client’s perspective.
  3. Blue represents complete availability. As you can see, they are all client consistency models, so they are generally considered AP databases.

We can see that the consistency level decreases from top to bottom in the diagram. I introduced the first three consistency models in Lecture 05, and now I will introduce the remaining ones, which are all client consistency models.

Client Consistency #

Client consistency refers to observing system consistency from the perspective of a client. We previously studied consistency from the dimension of “ordering” because it focuses on the consistency of data between multiple nodes. However, if we only consider one client, we only need to study “latency”.

In distributed databases, a node is likely to be connected to multiple replicas, and replication latency can cause inconsistencies when reading data from different replicas. Client consistency is designed to define and solve this problem, which includes Writes Follow Reads, Pipelined Random Access Memory (PRAM)/FIFO, Monotonic Reads, and Monotonic Writes.

Writes Follow Reads (WFR)

WFR is also known as session causal consistency. As you can see, it differs from causal consistency in that it only focuses on one client. So, as a comparison aid, if a client reads value V1 in one operation and then writes V2 after that, from the perspective of other nodes, the order of writes must be V1 first and then V2.

The latency problem of WFR can be described as follows: when writing V1, replication latency is allowed. But once V1 is read, it needs to be assumed that V1 has been written to all replicas, in order to ensure the correctness when writing V2 from the replica.

Pipelined Random Access Memory (PRAM)/FIFO The name “Pipe Random Access Memory” (PRAM) comes from the shared memory access model. As mentioned in Lesson 05, distributed systems borrow the concept of concurrent memory access consistency to explain their own problems. Later, people found this name strange and changed it to First-In-First-Out (FIFO) to name the similar consistency in distributed systems.

PRAM is described as a write operation initiated from a node, and the execution order of other nodes is consistent with that of the node. The biggest difference between PRAM and sequential consistency is that sequential consistency requires a fixed order for all node writes, while PRAM only requires ordering for a node’s own operations and allows different nodes to have no order.

PRAM can be divided into the following three consistencies.

  1. Read Your Write (RYW): After a node writes data, it can read that data on the same or other nodes.
  2. Monotonic Read (MR): It emphasizes that if a value is read, any subsequent read will also read that value or a value after it.
  3. Monotonic Write (MW): If two values are written from a node and their execution order is V1, V2, then the execution order observed from any node should be V1, V2.

Consistency that satisfies RYW, MR, and MW at the same time is PRAM. The implementation of PRAM generally involves clients continuously connecting to the same node, so there is no latency issue when reading and writing to the same node.

We can combine PRAM with WFR to achieve stronger causal consistency. That is, by having a client connect to the same node and maintaining session-level causal consistency, we can obtain normal causal consistency. This pattern differs from the one introduced in Lesson 05; this time, we use a model-based recursive approach to build consistency for the purpose of easier model memorization. However, this does not mean that causal consistency must always be built in this model-based way; on the contrary, the timestamp-based model introduced in Lesson 05 is more common.

As we just mentioned, PRAM is strictly available, but not completely available. If complete availability is required, RYW can generally be sacrificed, leaving only MR and MW. This scenario is suitable for cases where writes and reads are initiated by different clients.

So far, we have introduced all the strong consistency models. If you understand the diagram above, you have a complete understanding of consistency models. Now let’s move on to the weakest consistency model, eventual consistency.

Eventual Consistency #

Eventual consistency is a very famous concept. With the development of the internet and large-scale distributed systems, this concept has been widely spread. It is described as asynchronous data replication between replicas. If the data stops being modified, the replicas will eventually become completely consistent. This eventual consistency can take milliseconds, days, months, or even “forever”.

Eventual consistency has the highest degree of concurrency, as data writes and reads are not constrained by any other conditions. If concurrent writes modify the same data, various conflict resolution methods mentioned earlier are generally used, such as last-write-wins or vector clocks.

However, eventual consistency is not completely available in distributed databases. It can cause various skew phenomena, such as not being able to read the written data or sometimes being able to read it and sometimes not. Because the database system is a core underlying system on which many applications are built, such instability is difficult to accept in the design of distributed databases. Therefore, we often use tunable eventual consistency to implement AP-type distributed databases.

Tunable Consistency #

In general distributed systems, writes and reads are targeted at one node. However, tunable consistency addresses the drawbacks of eventual consistency by introducing the ability to read from multiple nodes simultaneously. Now let’s introduce three variables in the design of tunable consistency.

  1. Number of Replicas (N): This is the total number of nodes in the distributed cluster, i.e., the total number of replicas.
  2. Minimum Concurrent Writes (W): When a piece of data is synchronized to this number of nodes, we consider the data to have been successfully written.
  3. Minimum concurrent read count R: When reading data, at least this number of nodes should be read, and the most recent data is selected after comparison. Only in this way do we consider a read operation successful.

When the concurrent read and write count of a distributed system satisfies the following formula:

W + R > N

Then we consider that the concurrency of the system can guarantee that the latest data can always be read. As you can see, there must be an overlap between the nodes where writes and reads are performed, so each read will discover the latest written node.

A common example is N=3, W=2, and R=2. In this case, the system can tolerate the failure of one node. Under normal circumstances, all three nodes can provide read and write services. If one node fails, the minimum number of nodes for read and write can still be met. When all three nodes are functioning, the latest data is present in at least two of them, so we can randomly read two out of three nodes, which ensures that at least one node contains the latest data.

You may have noticed that I used the term “minimum” frequently in the previous paragraphs. This indicates that when implementing such a distributed database in practice, we can write to three nodes simultaneously when performing writes. However, as long as two of the nodes return successfully, we consider the write operation successful. The same applies to reading, where we can initiate three concurrent reads, but only need to retrieve the fastest two results.

So, some may ask, why not write to or read from all nodes every time? My answer is that it is also possible. For example, in scenarios with high write load, we can choose W=1 and R=N; conversely, in scenarios with high read load, we can choose W=N and R=1. You may notice that these two patterns correspond to the strong consistency discussed earlier: the former is client consistency, and the latter is data consistency (synchronous replication). Therefore, adjusting consistency covers a range from weak consistency to strong consistency.

How do we choose W and R? Increasing W and R improves availability but increases latency, lowering the concurrency; conversely, decreasing W and R results in the opposite. A commonly used method is the Quorums methodology, where Quorums refers to a majority of nodes in the cluster. For example, if there are 3 nodes in a cluster, Quorums would be 2. This concept is repeatedly mentioned in distributed databases, such as leader elections and consensus mechanisms.

For adjustable consistency, if both W and R are set to Quorums, data can be read and written as long as the number of failed nodes in the system is less than Quorums. This method strikes the best balance between concurrent reading and availability. Since W and R are smaller than Quorums, the condition W+R>N is not met; increasing W and R only lowers concurrency without affecting availability. The larger W and R are, the fewer nodes can fail.

However, there is a classic consideration when using the Quorums methodology, which is that the number of nodes should be odd; otherwise, it is impossible to form a majority of Quorums nodes.

Witness Replica #

In the previous paragraphs, I introduced the Quorums method to improve the availability of reads. This means that multiple replicas are written when performing writes, and multiple replicas are read when performing reads. As long as these two sets of replicas have an intersection, consistency can be ensured. Although not all replicas are written when performing writes, the data is usually replicated to all replicas through replication. For example, if there are 9 nodes and Quorums is 5, even if only 5 nodes are initially written, the data will eventually be present on all 9 nodes. This actually increases disk consumption but does not substantially improve availability.

To address this, we can introduce Witness replicas to improve the situation described above. The nodes in the cluster are divided into replication nodes and Witness nodes. Replication nodes store the actual data, while Witness nodes do not store data under normal circumstances. However, when the number of available nodes in the cluster decreases, we can temporarily convert some Witness nodes into nodes capable of storing data. When the nodes in the cluster recover, we can convert them back into Witness nodes, thereby releasing the stored data.

So, how many Witness replicas are needed to ensure consistency? Let’s assume we have r replication replicas and w Witness replicas. The total number of replicas is r+w, and the following two rules need to be satisfied:

  1. Read and write operations must involve a number of nodes equal to the Quorums count, which is (r+w)/2+1.
  2. Among the nodes stated in condition 1, at least one of them must be a replication node. As long as these two rules are met, the addition of the Witeness node can ensure consistency.

Nowadays, distributed databases widely use Witeness nodes to improve data storage efficiency, such as Apache Cassandra, Spanner, and TiDB, among others. However, their usage methods differ, so if you are interested, you can conduct further research on your own.

CRDT Algorithm #

In the previous discussion on eventual consistency solutions, we mentioned using adjustable means to maintain consistency. We can also use Conflict-Free Replicated Data Type (CRDT) to solve data conflict issues in eventual consistency.

Eric Brewer, the proposer of the CAP theorem, mentioned in his writing when reviewing CAP that C and A are not entirely mutually exclusive, and he recommends using CRDT to ensure consistency. Since then, various distributed systems and applications have started exploring CRDT, and Microsoft’s CosmosDB uses CRDT as a solution for multi-master consistency. Many cloud providers also use CRDT to devise Redis’s multi-master consistency solution.

As CRDT algorithms are still in the rapid development stage, for your convenience, I will elaborate on Ctrip’s internal Redis cluster consistency solution, which has practical technical choices. If you are interested in CRDT, you can further explore it, and here I will not provide further explanations on CRDTs like PN-Counter and G-Set.

Since the most common way Redis handles data is by setting string data, we need to use the CRDT register for processing. The Ctrip team chose the classic LWW Register, which means “last write wins” as the conflict resolution strategy.

In this solution, the most important thing is for the data to carry timestamps. We use the following figure to illustrate how it works.

Drawing 1.png

From the figure, we can see that the data of each node is a tuple consisting of a value and a timestamp. It can be seen that the merging of data between nodes is based on the timestamps, meaning that the value of the node with the largest timestamp determines the merged result. Using LWW Register ensures eventual consistency of the merged results under high concurrency.

However, when it comes to deletion, another algorithm is required. That is the Observed-Remove SET (OR Set), which mainly aims to solve the problem of not being able to delete and then add values again in general algorithms.

Compared to the LWW Register, it will be a bit more complex. In addition to timestamps, each value also needs to be marked with a unique tag. For example, in the figure mentioned above, P1 sets (1,3), but in actuality, it needs to be set as (1α,3). Then, if 1 is deleted, the set will be empty. When 1 is added again, the tag needs to be different from the previous one, making it (1β,5). This ensures that the deletion operation in step 2 will not affect the addition operation in step 3. Although they have the same numerical value, the tags are different, making them unique.

These are the core technical solutions for Redis’ cross IDC asynchronous synchronization. Of course, there are still many details that you can learn on your own if you are interested.

Conclusion #

By now, we have learned all about consistency issues in distributed databases. With this knowledge, you should understand the consistency model diagram in order to comprehensively grasp data-side consistency and client-side consistency. Additionally, combining this knowledge with the CAP theorem will allow you to appreciate the trade-offs between different consistency levels and availability.

Eventual consistency is generally applied in scenarios with cross-data center or cross-region nodes where there is no primary synchronization. Using adjustable consistency and CRDT algorithms can ensure the consistency of synchronization.

After learning about consistency, we can then assess the consistency concepts in various distributed database documentation and understand the design principles behind them. In the final lecture of this module, I will provide examples to explain the underlying logic in the consistency aspects of distributed databases.

I welcome you to think along with me and wish you continuous growth every day. In the next lecture, we will explore how data is reliably transmitted, and I hope to see you on time.