20 Common Algorithms Explained Clarifying Differences Among Algorithms Like Paxos and Raft

20 Common Algorithms Explained - Clarifying Differences Among Algorithms like Paxos and Raft #

Now, we’re going to dive into the final topic of distributed systems: consensus algorithms. In our previous discussions on various distributed technologies, you may recall that we covered failure models, failure detection, leader election, and consistency models. While these technologies can be used independently, it would be wonderful if we could use a single set of techniques to achieve all of these functionalities. After years of effort by researchers in the field of distributed databases and distributed systems, a breakthrough was finally made with the birth of consensus algorithms.

Although consensus algorithms are the essence of distributed systems theory, you already have a good understanding of them based on our previous discussions. They first address the challenging issue of failures in distributed systems, enabling the detection of failed nodes through built-in failure detection mechanisms. Leader election mechanisms ensure efficient data processing, while consistency models ensure message consistency.

In this lesson, I will introduce the distinguishing features of several commonly used consensus algorithms. I won’t go into the detailed execution process of each algorithm, as those details are abstract and not particularly helpful for practical usage. My goal in this lesson is to explain these algorithms to you from a higher-level perspective, in the hope of providing you with a vivid memory of them and helping you apply them in practice. If you’re interested, you can study the implementation details on your own.

Before we delve into the consensus protocols, let’s discuss the three properties of consensus algorithms.

Validity: The agreed-upon value by honest nodes must originate from proposed values by honest nodes.
Agreement: All honest nodes must reach a consensus on the same value.
Termination: Honest nodes must eventually reach a consensus on some value.

You may have noticed that consensus algorithms require “honest” nodes. The concept of “honest” nodes refers to nodes that do not exhibit the “arbitrary failures” or “Byzantine failures” described by the failure models. Since database nodes generally fulfill this assumption, we can assume that all nodes discussed below are honest.

We can rephrase the above properties by using the terms introduced in Lesson 15, “Leader Election: How to Safely Coordinate Operations within a Distributed System.” Validity and agreement determine safety, while termination corresponds to liveness. Let’s review these two characteristics.

Safety: In the event of failures, the consensus system must not generate incorrect results.
Liveness: The system must continue to make progress and generate outputs, meaning it should not be stuck in an intermediate state indefinitely.

Based on these characteristics, let’s start discussing the commonly used consensus algorithms.

Atomic broadcast and ZAB #

Broadcast protocols are a class of protocols that synchronize data from one node to multiple nodes. In Lesson 17, “Reliable Data Dissemination: How to Ensure Reliable Operations in a Database,” I discussed how eventual consistency systems ensure consistent data dissemination through various anti-entropy mechanisms. Specifically, gossip protocols ensure large-scale data synchronization, and under normal circumstances, Gossip disseminates data using a broadcast mode.

However, the broadcast process poses a problem: the coordinating node is an obvious single point of failure, and its reliability is crucial. To ensure its reliability, the first problem to solve is checking the health status of this node. We can discover its health condition using various health monitoring methods.

If the coordinating node fails, it will result in the message being propagated to some nodes but not others, violating “agreement.” How can we solve this problem? A simple algorithm is to use the “flooding” mechanism, where once a message is broadcasted to a node, that node is obliged to broadcast the message to other nodes that have not received it yet. This is similar to irrigating a field, where eventually the entire system receives the data.

Of course, the above pattern has an obvious drawback, which is the generation of N^2 messages. Here, N represents the number of remaining unsynchronized message nodes in the system, so our optimization goal is to reduce the total number of messages.

Although broadcasting can deliver data reliably, through consistent learning, we know that it is necessary to ensure the order in which each node receives the message in order to achieve strict consistency. Therefore, we define an atomic broadcast protocol here to meet this requirement.

Atomicity: All participating nodes receive and propagate the message; or conversely, none of them propagate the message.
Orderliness: The order in which participating nodes propagate messages is consistent.

A protocol that satisfies the above conditions is referred to as an atomic broadcast protocol. Now let me introduce the most common atomic broadcast protocol: Zookeeper Atomic Broadcast (ZAB).

ZAB #

ZAB protocol has become very popular due to the widespread usage in Zookeeper. It is an atomic broadcast protocol that guarantees the delivery of messages in order, and the atomicity in message broadcasting ensures message consistency.

In the ZAB protocol, nodes have two roles.

Leader node: The leader is a temporary role with a term. This is done to ensure the liveness of the leader role. The leader node controls the execution process of the algorithm, broadcasts messages, and ensures that messages are propagated in order. Both read and write operations need to go through it, thereby ensuring that the operations are performed on the latest data. If a client is not connected to the leader node, the message it sends will also be forwarded to the leader node.
Follower node: The main role is to receive messages sent by the leader and detect the leader’s health status.

Since there needs to be a leader node, we need a leader election algorithm. Here, we need to clarify two IDs: data ID and node ID. The former can be seen as the timestamp of the message, and the latter is the priority of the node. The principle of the election is: within the same term, the larger the data ID of a node, the newer the data of that node, and the node with the largest data ID is given priority in the voting. If all nodes have the same data ID, the node with the largest node ID is given priority in the voting. Once a node receives more than half of the votes, it becomes the leader node.

Once the leader node is elected, it needs to do two things.

Declare the term: The leader node notifies all follower nodes of the current latest term, and then the follower nodes confirm that the current term is the latest term, thereby synchronizing the status of all nodes. Through this process, follower nodes will no longer accept messages from old terms.
Synchronize the status: This step is crucial. First, the leader node notifies all follower nodes of its leadership identity, and then the follower nodes will not elect themselves as leaders. Then the leader node synchronizes the message history within the cluster, ensuring that the latest messages are synchronized among all nodes. Because the newly elected leader node may not have the latest accepted data, synchronizing historical data is necessary.

After the above initialization actions, the leader node can receive messages normally, sort the messages, and then broadcast them. When broadcasting messages, it is necessary for a Quorum (a majority of nodes in the cluster) of nodes to return the accepted messages before considering the messages to be correctly broadcasted. At the same time, to ensure order, the previous message must be broadcasted properly before the next message can be broadcasted.

The leader node and follower nodes use a heartbeat algorithm to check each other’s health. If the leader node realizes that it has lost contact with the Quorum nodes, such as network partitioning, the leader node voluntarily steps down and starts a new round of elections. Similarly, when a follower node detects that the leader node has a significant delay, it will also trigger a new round of elections. The advantage of the ZAB election is that if the leader node remains healthy, even if the current term expires, the original leader node will still assume the leadership role after the election, without triggering a leader node switch. This ensures the stability of the algorithm. In addition, its node recovery is relatively efficient. By comparing the message IDs of each node and finding the largest message ID, the latest data can be recovered from it. Finally, its message broadcast can be understood as a two-phase commit without the voting process, as it only requires two rounds of messages to broadcast the message.

So what is the relationship between the atomic broadcast protocol and the consensus algorithm introduced in this lecture? Here, I will first leave a “hidden deduction” and introduce the typical consensus algorithm Paxos, and then explain the relationship between them.

Paxos #

The so-called Paxos algorithm is a coordination algorithm designed to achieve consensus among all nodes in a cluster for a value sent by a client to any point in the cluster. This value is accompanied by a version number, ensuring that the messages are ordered and consistent across all nodes in the cluster.

The basic Paxos algorithm is very simple, and it consists of three roles.

Proposer: There can be multiple proposers, and each proposer proposes a proposal (value). The value can be any action, such as “set the value of a certain variable to value”. Different proposers can propose different values, but for the same round of Paxos process, only one value can be approved.
Acceptor: There are N acceptors, and the value proposed by a proposer must be approved by a quorum of acceptors to pass. The acceptors are completely equal and independent of each other.
Learner: As mentioned earlier, as long as a quorum of acceptors approves a value, the purpose of the learner role is to synchronize the approved deterministic value to other undetermined acceptors.

These three roles actually describe the whole process of a value being committed. However, the basic Paxos is just a theoretical model, because in real scenarios, we need to handle many consecutive values, and these values are concurrent. If we fully execute the process described above, the performance cost would be unbearable for any production system. Therefore, what we generally use is Multi-Paxos.

Multi-Paxos allows concurrent execution of multiple Paxos protocols. Its optimization focus is to merge the Propose phase, which introduces a leader role, also known as the leader node. The leader handles all reads and writes. Similarly to ZAB, the leader also has the concept of terms, and heartbeats are used for mutual liveness detection between the leader and other nodes. Do you feel the resemblance? In the following, I will compare the similarities and differences between the two.

In addition, Multi-Paxos introduces two important concepts: replicated log and state snapshot.

Replicated log: After a value is committed, it is written into a log. This log structure not only provides persistent storage, but more importantly, it ensures the order of the messages. The goal of the Paxos algorithm is to ensure the strong consistency of the log content for each node.
State snapshot: Since the log structure saves all values, the log will become larger over time. Therefore, the algorithm implements a state snapshot, which can save the latest log messages. After the snapshot is generated, we can safely delete the logs before the snapshot.

For those familiar with Raft, you may notice that the above structure is already very similar to Raft. After discussing atomic broadcast and consensus, we will introduce Raft.

Atomic Broadcast and Consensus #

As I mentioned at the beginning, this lecture is not about the details of algorithms, but focuses on why they are the way they are today. From the brief introduction above, we have already found that ZAB is actually very similar to Multi-Paxos. Essentially, both of them require the majority of nodes to “agree” on a value, and they both have a leader node which is temporary. They seem more and more similar, but in essence, they are different.

In simple terms, ZAB originated from the primary-backup replication scenario, which is the replication technology we introduced before, while the consensus algorithm is a state machine replication system. The so-called state machine replication system refers to a system where each node in a cluster is a state machine. If a group of clients concurrently submit different values to different state machines in the system, the system ensures that each state machine can execute client requests in the same order. Once a request is submitted, its order is guaranteed. However, before it is submitted, the order is determined by the leader and this order can be arbitrary. Once a leader is re-elected, the new leader can sort the unsubmitted values arbitrarily.

On the other hand, the ZAB (ZooKeeper Atomic Broadcast) protocol, which originates from primary-backup replication, emphasizes that the order of messages is determined by the leader and strictly enforced by the followers, without any coordination. The most important difference is that after leader re-election, the new leader still broadcasts data in the order established by the original leader, without reordering.

Therefore, it can be said that ZAB can achieve strict linearizability, while Multi-Paxos, because it only allows concurrent writes, does not achieve linearizability but rather a form of sequential consistency, where the order is only determined upon data submission, rather than having a leader allocate the order in advance, which remains consistent with the order of data submission. For more information on linearizability and sequential consistency, please refer to “05 | Consistency and CAP Theorem: Why Do We Need Distributed Consistency?”.

Due to the efficiency considerations in consensus algorithms such as Paxos, a leader is introduced. Under normal circumstances, the differences between the two are not significant, and the main difference lies in the leader election process.

After learning about ZAB and Multi-Paxos, I will now introduce the main topic of this session, the Raft algorithm, which is currently the most important algorithm in the field of distributed databases.

Highlights of Raft #

Raft can be seen as an improved version of Multi-Paxos, as the author of Raft has given a comparison presentation between Raft and Multi-Paxos during his time at Stanford University, so we can consider them as similar algorithms.

The Raft algorithm can be said to be the most successful distributed consensus algorithm at present, and it is used in technologies such as TiDB, FaunaDB, and Redis. The reason is that Multi-Paxos does not have specific implementation details. Although it allows developers to use their imagination, consensus algorithms generally occupy a crucial position and any potential issues can have catastrophic consequences for the system. Raft, on the other hand, provides a wealth of implementation details and has two advantages compared to Multi-Paxos.

The requests being sent are continuous, which means that Raft’s write log operations must be continuous, while Multi-Paxos allows for concurrent modifications to the logs, which reflects its “Multi” characteristic.
The leader must have the most up-to-date and complete logs in order to be elected, which shares the same principle as the ZAB algorithm, while Multi-Paxos is random. Therefore, Raft can be seen as a simplified version of Multi-Paxos, and it is this simplification that has made Raft popular.

The randomness in Multi-Paxos means that no node has complete and up-to-date data, so its recovery process is extremely complex and requires synchronization of the history records among nodes, while Raft can easily find the most up-to-date node, thus speeding up the recovery process. However, the benefits of out-of-order submission and non-continuous logs are that they greatly improve write concurrency performance, thereby increasing throughput. So these two features are not disadvantages, but rather a result of weighing pros and cons. TiKV, for example, uses a multi-RaftGroup mode when using Raft, which increases the concurrency of single Raft structures, and this can be seen as a reference to Multi-Paxos.

Both Raft and Multi-Paxos use leader election based on terms. The advantage is high performance, but the drawback is that it disrupts service during leader changes, leading to decreased availability. Therefore, we generally consider consensus services as CP services (according to the CAP theorem). However, some teams choose to use basic Paxos in order to improve availability, such as WeChat’s PaxosStore, which uses a separate Paxos round for each round.

These two improvements make Raft a better fit for implementation, and it can be said that almost all of the latest databases are using this algorithm. If you want to learn more about the algorithm’s details, please refer to https://raft.github.io/. From there, you can not only learn about the details of the algorithm, but more importantly, you can also see many completed implementations, which can give you a deeper impression when combined with studying the code.

Summary #

Consensus algorithms are a relatively large topic. This session focuses on three common consensus algorithms, highlighting their core features. By comparing their similarities and differences, I hope to deepen your understanding of their characteristics.

Consensus algorithms are also core components of modern distributed databases. Fortunately, their APIs are relatively easy to understand, and there are already mature implementations available. Therefore, I believe that the details of the algorithms are not the focus of this session. Understanding why they are the way they are can help us understand the basis for choosing databases.

With that, we have covered all the key points of this module. In the next session, I will review the content of this module with you and demonstrate the relationship between typical distributed database features and the knowledge we have learned through several case studies. See you then.