05 Consistency and the Cap Model Why We Need Distributed Consistency

05 Consistency and the CAP Model - Why We Need Distributed Consistency #

In the previous lecture, we discussed the concept of replication, which included the notion of “consistency”. In this lecture, let’s talk about the CAP theorem and the concept of consistency. I will focus on the consistency model, as it is the theoretical foundation for replication consistency and distributed transactions.

Before we begin, let’s first discuss what “consistency” in distributed databases, and even in general distributed systems, really means.

Consistency is a Prerequisite for High Availability #

In the real world, the nodes of a distributed database are not always active and able to communicate with each other. However, these failures should not affect the availability of the database. In other words, from the user’s perspective, the system must continue to operate as if no failures have occurred. System availability is a crucial characteristic of distributed databases, and even in software engineering, we always strive to achieve high availability and minimize downtime as much as possible.

To make a system highly available, it needs to be designed to tolerate the failure or unavailability of one or more nodes. To achieve this, we need to introduce replication techniques, as mentioned in the previous lecture, which involves using multiple redundant replicas to improve the availability of the system. However, once these replicas are added, we face the challenge of keeping multiple data replicas in sync and how to recover the system after a failure.

This is where the concept of RPO introduced in the development of MySQL replication comes in, which means that the system not only needs to be available but the data also needs to be consistent. Therefore, high availability must strive to satisfy two indicators: business continuity and data consistency.

The CAP theorem that we are about to introduce will tell us that there is a third factor to consider - the impact of network partitions on availability. It will inform us that availability and consistency cannot be simultaneously achieved in the presence of network partitions.

The CAP Theorem and Considerations #

First of all, availability is the ability of a system to successfully process and respond to each request. Availability is defined as the overall perception of the system’s response by the users. However, in practice, we want each component of the system to maintain availability.

Secondly, we want every operation to be consistent. In this context, consistency refers to either atomic consistency or linearizability. Linearizability can be understood as follows: within a distributed system, the operation history on all identical replicas can be viewed as a log, and the order of operations in the log is the same. Linearizability simplifies the calculation of possible system states and makes the distributed system appear as if it is running on a single machine.

Finally, we want to achieve both consistency and availability while tolerating network partitions. Networks are highly volatile and often split into multiple independent subnetworks that cannot communicate with each other. In these partitioned nodes, certain messages sent between them will not reach their destination.

So, to summarize, availability requires that any fault-free node can provide service, while consistency requires that the results be linearly consistent. The CAP theorem, proposed by Eric Brewer, discusses the trade-offs between consistency, availability, and partition tolerance.

It states that asynchronous systems cannot satisfy availability requirements and that it is impossible to achieve both availability and consistency in the presence of network partitions. However, we can build systems that prioritize availability while still ensuring strong consistency, or systems that prioritize consistency while still ensuring availability to the best of their abilities. The term “best effort” mentioned here means that if everything is normal, the system can guarantee the provision of this feature. However, in the case of network partitioning, it allows for weakening and violation of this guarantee. In other words, the CAP theorem describes a trade-off, where choices need to be made. Based on the definition of the CAP theory, we can have the following types of systems:

CP system: Consistent and partition-tolerant system. It is more inclined to reduce service time rather than provide inconsistent data. Some NewSQL databases built for transactional scenarios tend to adopt this strategy, such as TiDB, Alibaba Cloud PolarDB, AWS Aurora, etc. But they generate their own A, which means they have highly availability.
AP system: Available and partition-tolerant system. It relaxes the consistency requirements and allows for the provision of potentially inconsistent values during requests. Columnar storage and NoSQL databases tend to favor AP, such as Apache Cassandra. They adjust the consistency patterns to provide high consistency.

The implementation approach for CP systems involves introducing consensus algorithms, where a majority of nodes need to participate to ensure consistency. If consistency needs to be maintained at all times, then in the case of network partitioning, some nodes may become unavailable.

On the other hand, AP systems can start with just one replica, and the database will always accept write and read services. It may eventually lose data or produce inconsistent results. Client mode or session models can be used here to provide consistency solutions.

There are some limitations to consider when using the CAP theorem.

CAP discusses network partitioning, not node crashes or any other type of failure. This means that nodes after network partitioning may all accept requests, resulting in inconsistencies. However, crashed nodes will not respond at all and will not cause the aforementioned consistency issues. In other words, not all nodes after partitioning will face consistency issues. On the contrary, network partitioning does not include all types of failures in real scenarios.

CAP means that even if all nodes are operational, we may still encounter consistency issues because of connectivity problems between them. The CAP theorem is often represented by a triangle, as if we can match any three parameters at will. However, although we can adjust availability and consistency, partition tolerance is something we cannot actually sacrifice.

If we choose CA and give up P, then in the event of partitioning, to ensure C, the system needs to prohibit write operations. In other words, the system is unavailable when there are write requests. This conflicts with A, which requires the system to be available. Therefore, it is theoretically impossible for a distributed system to choose the CA architecture and can only choose CP or AP architecture.

As shown in the figure below, in fact, CA-type systems do not exist, and you need to pay special attention to this.

Drawing 0.png

Figure 1 CAP Theory

The availability in CAP is different from the aforementioned high availability. CAP defines that there are no restrictions on the latency of requests. Additionally, unlike CAP, high availability of a database does not require every online node to be able to provide services. The “C” in CAP stands for linear consistency. In addition to linear consistency, there are other consistency models that we will discuss in detail.

Consistency Models #

Consistency models are classic concepts in distributed systems and important knowledge points when studying distributed databases. However, few people know that consistency models actually originated from shared memory in single-machine theory.

From the perspective of users, a distributed database is like a single-machine database with shared memory. The communication and message passing between nodes are hidden inside the database, which may give users the illusion that a distributed database is a form of shared memory. A single storage unit that supports read and write operations is usually called a register, and we can consider the shared storage that represents a distributed database as a group of such registers.

Each read and write operation on a register is abstracted as a “call” and “completion”. If a failure occurs after a “call” but before the “completion”, we define the operation as a failure. If the call and completion events of an operation occur before another operation is called, we say that this operation happens before the other operation and that these two operations are sequential; otherwise, we say they are concurrent.

As shown in the figure below, a) represents sequential operations, and b) and c) represent concurrent operations.

Drawing 1.png

Figure 2: Sequential operations & concurrent operations

Multiple read or write operations can access a register at the same time. The read and write operations on a register are not instant, and they take some time, namely the time between the call and completion. Concurrent read/write operations performed by different processes are not serialized. Depending on the behavior of registers when operations overlap, their order may differ, and different results may be produced.

When we discuss database consistency, we can distinguish it from two dimensions.

Latency: It is the time difference between the moment when data changes and the moment when its copies receive the data. This is the replication delay scenario introduced in the previous lecture, which is generally classified as “client consistency”. We will further discuss it in “15 | Consistency Models Other Than CAP”.
Order: It refers to the sequential state in which various operations are performed on all copies of the system. This is the focus of this lecture on consistency models.

Now let’s further explore the concept of order.

When faced with a series of read and write operations, as humans, we have a subjective judgment about their execution order. Even for a single-machine data, the order of these operations can be determined. However, making such judgments in a distributed system is not so easy because it is difficult to know exactly when something happened and it is difficult to immediately synchronize these operations across the entire cluster.

To reason about the order of operations and indicate the true result, we must define a consistency model to guarantee the order.

How do we understand the meaning of “guarantee” in the model? It is to regard the consistency model as an agreement between the user and the database, specifying how each database replica should behave to meet this order guarantee. And what do users expect when reading and writing data? In other words, even when data is read and written concurrently, users can obtain some predictable results.

Note that we will discuss single-object and single-operation consistency models, but real database transactions involve multiple steps, which we will further discuss in the “Transactions and Consistency” section below.

Below, I will introduce the consistency models in order of their order guarantee, from strong to weak.

Strong Consistency #

Strong consistency is similar to the absence of replication: any write on any node can be immediately available for subsequent reads on all nodes. It involves the concept of a global clock. If any node writes new data A at time T1, then all nodes should read the new written A at time T2 (T2>satisfies T2>T1).

Unfortunately, this is only a theoretical model that cannot be achieved in reality. Because various physical limitations make it impossible for distributed data to synchronize instantaneously.

Linear Consistency #

Linear consistency is the strictest and achievable consistency model for single-object and single-operation. In this model, the value written can be read by other nodes at some point between the call and completion. And all nodes reading the data should see it as an atomic operation, without seeing any data conversion process or intermediate incomplete states.

Linear consistency requires that once newly written data is read, all subsequent read operations should be able to read this data. In other words, once a read operation reads a value, all subsequent read operations will read this value or at least the “most recent” value.

The above definition comes from early academic papers. I will summarize the key points as follows:

A global clock is needed to achieve the so-called “most recent” concept. Because without a globally consistent time, two independent processes do not have the same concept of “most recent”.
Any read should be able to read the “most recent” value.

Let me explain linearizability using an example.

Suppose we have three nodes, and one of them performs a write operation on the shared variable x, while the third node reads the following values:

The first read operation can return 1, 2, or null (the initial value, before the two write operations), as the two write operations are still in progress. The first read can occur before both writes, between the first and second writes, and after both writes.
Since the first write operation has completed but the second write operation has not, the second read operation can only return 1 or 2.
The third read can only return 2, as the second write occurs after the first write.

The following diagram illustrates linearizability:

Drawing 2.png

Figure 3 Linearizability

Linearizability comes at a high cost, and even CPUs do not use it. Friends with experience in concurrent programming are probably familiar with the Compare-and-Swap (CAS) operation, which achieves linearizability and is crucial for high-performance concurrent programming. It simulates linearizability through programming techniques.

A common misconception is that consistency algorithms can achieve linearizability, such as Paxos and Raft. However, in reality, this is not the case. Taking Raft as an example, the algorithm only ensures linearizability for replicating logs but does not describe how the logs are written to the final state machine. This implies that the state machine itself is not linearizable.

I recommend reading about TiKV’s implementation details of linearizability. Since linearizability is not cost-effective, I won’t go into further detail here. Let’s now talk about sequential consistency and causal consistency.

Sequential Consistency #

Since achieving strict consistency with a global clock is challenging, people came up with the idea of sequential consistency, which relaxes the constraints of a global clock and instead uses distributed logical clocks. Sequential consistency means that all processes see modifications in the same order. Read operations may not receive timely updates from other processes on the same data, but the order in which each process reads different values of the data is consistent.

The following diagram shows how P3 and P4 read after P1 and P2 have written. In terms of real time, 1 should be written before 2. However, in sequential consistency, 1 can be ordered after 2. Additionally, even though P3 has read value 1, P4 can still read 2. But it’s important to note that either the combination of 1->2 or 2->1, P3 and P4 choose one and remain consistent. The following diagram illustrates one possible reading order: 2->1.

Drawing 3.png

Figure 4 Sequential Consistency

To further distinguish between linearizability and sequential consistency, let’s use the following diagram:

Drawing 4.png

Figure 5 Distinguishing Linearizability and Sequential Consistency

In figure (a), sequential consistency is satisfied, but linearizability is not. From the perspective of a global clock, the read operation of process P2 on variable x occurs after the write operation of process P1 on variable x, yet the read result is an old value. However, this diagram satisfies sequential consistency because the consistencies of processes P1 and P2 do not conflict.

In figure (b), linearizability is satisfied because each read operation reads the most recently written result for that variable, and the operation order seen by both processes matches the order of the global clock.

In figure (c), sequential consistency is not satisfied because, from the perspective of process P1, its read operation on variable y returns the result 0. This means that the read operation of process P1 on variable y occurs before the write operation of process P2 on variable y, and the same applies to variable x. Therefore, this order does not satisfy sequential consistency.

In practice, you can use the consistency algorithms mentioned earlier to achieve sequential consistency. These algorithms ensure that operations are executed in the same order on each node, guaranteeing sequential consistency.

Systems like Google Megastore use Paxos algorithms to achieve sequential consistency. This means that within Megastore, if there is a data update, all nodes synchronize the update, and the operation order remains consistent across all nodes.

Causal Consistency #

Compared to sequential consistency, the requirements for causal consistency are lower. It only requires that operations with a causal relationship have a consistent order, while operations without a causal relationship have a random order.

The requirements for causal relevance are as follows:

Local order: The order in which events are executed within a process constitutes the local causal order.
Remote order: If a read operation returns the value written by a write operation, then that write operation must be ordered before the read operation in terms of ordering.
Closure passing: As defined in the vector clock, if a->b and b->c, then there must be a->c.

So, why do we need causal relationships, and how do we propagate without causal relationships? In the following diagram, the write operations performed by processes P1 and P2 do not have a causal relationship, resulting in eventual consistency. The results of these operations may be propagated to the reading end at different times and in a different order. Process P3 will see value 1 before seeing 2, while P4 will see 2 first and then see 1.

Drawing 5.png

Figure 6: Causal Consistency

In contrast, the following diagram shows that the write operations performed by processes P1 and P2 are causally related and are propagated to processes P3 and P4 in their logical order. In addition to writing data, causal writes also require an attached logical clock to ensure that the two writes have a causal relationship. This can prevent the situation shown in the previous diagram. You can compare the history records of P3 and P4 in both diagrams.

Drawing 6.png

Figure 7: Logical Clock

One major way to implement this logical clock is through vector clocks. The vector clock algorithm utilizes the vector data structure to broadcast the logical timestamps of all processes to each process. When a process sends an event, it writes all known process times into a vector and then propagates it.

A typical example of causal consistency is the COPS system, which is a KV database based on the causal + consistent model. It defines dependencies to achieve causal consistency in operations. This is very helpful for implementing distributed data causality in business. In addition, Amazon Dynamo, based on vector clocks, also achieves causal consistency.

Transaction Isolation Levels and Consistency Models #

Now we have discussed consistency models, but how are they related to transactions in the database field? I will start with the conclusion: they are related but unrelated.

How should we understand this? Let’s first discuss the lack of relationship between them.

ACID and CAP both have “C” for consistency, but their connotations are completely different. ADI are all capabilities provided by databases to ensure consistency, but “C” (consistency) is not. It is a logical constraint at the business level.

Take the classic example of transferring money as an illustration. Person A has 100 RMB and person B has 0 RMB. Now A wants to transfer 30 RMB to B. After the transfer, A has 70 RMB and B has 30 RMB, resulting in a total of 100 RMB. Obviously, this is just a logical constraint imposed by the business.

As for the “C” in CAP, I have already made it clear earlier, namely linearizability. It represents the immediacy of reading data from replicas, that is, the guarantee of “when” we can read the “correct” data. The more immediate it is, the more consistent the overall system is when reading data.

So how are they related? It is actually related to transaction isolation levels and consistency models.

If we consider the above example of linearizability as multiple concurrent transactions, we will find that they do not have isolation. This is because consistency models focus on individual operations, while transactions consist of a group of operations.

Now let’s look at another example that demonstrates the problem caused by the lack of consistency in transactions.

Drawing 7.png

Figure 8: Transactions and Consistency

Among them, three transactions satisfy isolation. It can be seen that T2 reads the value written by T1. However, this system lacks consistency guarantees, allowing T3 to read a value before T2 reads the value. This can potentially cause bugs in applications.

So here’s the conclusion: transaction isolation describes the behavior between parallel transactions, while consistency describes the behavior between non-parallel transactions. In fact, the broad concept of transaction isolation should be a mixture of classical isolation theories and consistency models.

For example, in some literature, you may see terms like “one-copy serializability” or “strong snapshot isolation.” The former actually refers to serializability isolation level plus sequential consistency, while the latter refers to snapshot isolation level plus linearizability.

Therefore, for distributed databases, the original isolation levels are not abandoned but rather enriched by introducing consistency models.

Summary #

This lecture is quite lengthy, but it has been condensed a lot. We started with high availability and discussed the impact of CAP theorem on evaluating distributed models. Then we focused on discussing consistency models, which is the core of this lecture, to help you evaluate the characteristics of distributed databases.

Finally, I explained the differences and connections between transaction isolation levels and consistency models, helping you understand the concept of transaction isolation levels in a distributed database.