05 Common Knowledge of Raft How to Ensure Data Consistency Across Multiple Data Centers

05 Common Knowledge of Raft How to Ensure Data Consistency Across Multiple Data Centers #

Hello, I am Xu Changlong.

In the previous lesson, we talked about how to achieve database synchronization between dual-active data centers using Otter, but this approach cannot guarantee strong transactional consistency between the dual-active data centers.

If data center A makes changes to a certain piece of data and data center B modifies it at the same time, Otter uses a merge logic to merge conflicting data rows or fields. In order to avoid similar issues, in the previous lesson, we made a requirement for the client: the user client can only access one data center for a period of time.

However, if the business requires “transaction + strong consistency” to a high degree, such as not allowing overselling of inventory, we usually have only two choices: one is to make the service a local service, but this approach does not suit all businesses; the other is to adopt multiple data centers, but we need to use a distributed consensus algorithm to ensure the consistency of multiple replicas.

In the industry, the most well-known distributed consensus algorithm is Paxos, but its principles are too abstract, and through multiple modifications during usage, it can deviate greatly from the original design, making many people uncertain if their modifications are reasonable. Moreover, many people need one to two years of practical experience to fully grasp this algorithm.

With the increasing demand for distributed multi-replica synchronization, the excessively abstract Paxos can no longer meet market needs. Therefore, the Raft algorithm was born.

Compared to Paxos, Raft is not only easier to understand, but it can also ensure the order of data operations. Therefore, it is widely used in distributed data services. Well-known foundational components such as etcd and Kafka are implemented using the Raft algorithm.

In today’s lesson, we will explore the implementation principles of Raft. It can be said that understanding Raft is equivalent to understanding half of the world of distributed strongly consistent data services. Almost all elections, data updates, and synchronizations across multiple data service nodes are implemented in a similar way, with adjustments made for different scenarios and applications.

How to Elect a Leader? #

To help you quickly understand the implementation principles of Raft, I will explain Raft based on the official example of Raft.

As shown in the figure, we start five Raft distributed data services: S1, S2, S3, S4, S5. Each node has the following three states:

Leader: Responsible for data modification and actively synchronizes the modified changes to the Followers.
Follower: Receives the modified data pushed by the Leader.
Candidate: If there is no Leader in the cluster, it enters the election mode.

If a Follower node in the cluster does not receive the heartbeat from the Leader within a specified time, it means the Leader is damaged, and the cluster cannot update data. At this time, the Follower enters the election mode and elects a Leader among multiple Followers, ensuring that there is always one Leader in the group of services and that data modification has a unique decision-making process.

So how is the Leader elected? After entering the election mode, these five services will wait for a random period of time. When the waiting time is up, the current service votes for itself first, and increases the current term by 1 (term:4 in the figure represents the fourth Leader). Then it sends RequestVote RPC (request for votes) to other services to campaign.

If a service receives a vote request and the term and synchronization progress of the candidate service (i.e., the service that sends the vote request) are ahead of or equal to its own, it will vote for the candidate service and update its own term to the latest term. At the same time, this service that receives the vote request will no longer initiate voting and will wait for invitations from other services.

Note that each service only votes once in the same term. If no service has obtained a majority of votes (votes from more than two-thirds of the service nodes), it will wait for the current election timeout, increase the term by 1, and conduct another election. Finally, the service that obtains a majority of votes and finishes the election countdown first will be selected as the Leader.

The elected Leader will broadcast a notification to the other services and synchronize the new term and its progress with the other services. At the same time, the new Leader will periodically send heartbeats during its term to ensure that the sub-services (Followers) do not switch to the election mode due to timeouts. During the election, if a service receives the heartbeat from the previous Leader, it will reject it (as shown in S1 in the figure).

After the election is over, all services enter the data synchronization state.

How to ensure consistency in multi-copy writing? #

During data synchronization, the Follower node will be completely consistent with the Leader’s log. It can be seen that the Raft algorithm also uses a master-slave synchronization method, except that the Leader is not a fixed service, but is elected.

In this way, when an individual node fails, it does not affect the overall service. However, this mechanism also has a drawback: if the Leader becomes disconnected, the overall service will be busy with elections for a period of time and unable to provide data services.

Generally speaking, client data modification requests are sent to the Leader node (such as S1 in the figure below) for unified decision-making. If the client’s request is sent to a Follower, the Follower will redirect the request to the Leader. So how does Raft achieve strong consistency of backup copies of the same partition?

- Specifically, after the Leader successfully modifies the data, it generates the corresponding log, and then the Leader sends a single log synchronization message to all Followers. As long as a majority of Followers return a successful synchronization, the Leader will commit the pre-submitted log and return a successful modification to the client.

Next, during the next heartbeat (in the leader commit field of the message), the Leader informs each Follower node of the current latest committed Log index (log progress), and then each Follower provides data to the outside world according to this index progress. Data that has not been finally committed by the Leader will not be displayed externally.

If there are other data modification requests sent to the Leader during the data synchronization period, these requests will be queued because the Leader is then blocked and waiting for responses from other nodes.

However, this design of blocking and waiting also makes the Raft algorithm highly dependent on network performance, because every modification requires concurrent requests to multiple nodes, waiting for the majority of nodes to successfully synchronize.

The worst case is that the returned round-trip time (RTT) will be based on the slowest network service response time (the time for one synchronization in the “two sites three centers” scenario is about 100ms), plus the fact that there is only one master node, the performance of a Raft service group is limited. To address this, we can reduce the amount of data and slice the data to improve the performance of the entire cluster for data modification.

Please note that when the log progress difference between the majority of Followers and the Leader is too large, data modification requests will be in a waiting state until more than half of the Followers have the same progress as the Leader before returning a successful modification. Of course, this situation is relatively rare.

How do services synchronize log progress? #

As we can see, logs play an important role in the data synchronization mechanism of Raft. When synchronizing data, Raft uses an ordered instruction log called Write Ahead Log (WAL), similar to MySQL’s binlog. This log records the instructions for each data modification, along with the modification term, and is marked by a Log Index to indicate the current progress of synchronization.

Among them, the leader’s log is never deleted, and all followers will keep themselves fully consistent with the leader, and any differences will be forcibly overwritten. In addition, each log has two stages: “write” and “commit”. During an election, each service will prioritize the synchronization progress of the log entries that have not yet been committed, ensuring that the elected leader has the latest and most complete data.

The leader sends synchronization requests to each node during its term, which is effectively pushing the logs to each node in order. If the leader’s synchronization progress exceeds that of the follower, the follower will reject the synchronization.

Upon receiving a rejection, the leader will identify the unsynchronized or different parts of the logs one by one from the latest to the oldest, and then overwrite them to achieve synchronization.

The synchronization progress of the leader and the follower is confirmed by the log index. The leader has absolute decision-making power over the content and order of the logs. When it discovers differences between its own logs and the follower’s logs, in order to ensure that the data of multiple replicas are completely consistent, it will forcibly overwrite the follower’s logs.

So how does the leader identify whether there are differences between the follower’s logs and its own logs? In fact, when the leader synchronizes logs with a follower, it will also bring along the term and index number of the leader’s previous log, and compare it with the follower’s current progress.

There are two aspects of the comparison: on one hand, compare the index, multiple operation logs, and terms of the current logs of the leader and the follower; on the other hand, compare the index and term of the leader’s previous log with the follower’s current progress.

If there is any difference, the leader will consider that the follower’s logs are inconsistent with its own logs. At this point, the leader will compare one by one in reverse order until it finds an index where the log content and term are completely identical, and then overwrite from this index in a forward order. Meanwhile, during the synchronization of log data, the leader will only commit data within its own term, and the data from past terms will be completely recovered through log synchronization in reverse order.

As you may have noticed, pushing synchronization one by one like this is somewhat slow and inefficient, which makes Raft not very friendly to newly started services. Therefore, the leader will take periodic snapshots, which merge the records of previous modified logs, in order to reduce the size of the modified logs. Followers with a large gap in synchronization progress will recover their data from the leader’s latest snapshot and catch up with the progress according to the index of the snapshot.

How to ensure strong consistency when reading data? #

From the previous explanation, we learned how Leader and Follower synchronize data. But from the perspective of the Follower, how does it ensure that the data it provides externally is the latest?

Here is a little trick: when a Follower receives a query request, it will also ask the Leader what is the latest commit log index. If this log index is greater than the progress of the Follower’s synchronization, it means that the Follower’s local data is not up to date. At this point, the Follower will retrieve the latest data from the Leader and return it to the client. It can be seen that ensuring strong consistency comes at a high cost.

You may wonder: how can we ensure strong consistency when reading data in business use? In fact, the mechanism of Raft, which waits for Leader to commit the log index, already ensures this. We just need to submit data modification operations to the Leader normally, and when a Follower reads, it will obtain the latest data.

Summary #

Many people say that Raft is a distributed consensus algorithm, but in fact, Raft algorithm is a consensus algorithm (reaching consensus among multiple nodes). It achieves dynamic scaling and high availability of services through the mechanism of terms, random timeouts, and voting.

By using Raft’s log synchronization with strict sequencing, multiple copies of data can be synchronously and strongly consistent. If we use the Raft algorithm to implement the data storage layer for users, the storage and operations (addition, deletion, modification) of data will have strong consistency across multiple data centers. In this way, the business layer does not need to worry about consistency issues and can directly operate on the data, easily achieving strong consistency synchronization across multiple data centers.

However, due to the high cost and latency of this synchronization method, it is recommended to use it in scenarios with relatively small amounts of data and modifications. There are also many libraries designed for different scenarios in the industry, such as parallel-raft, multi-paxos, SOFAJRaft, etc. Please refer to the open source list at the bottom of the Raft document for more options.

Reflection Questions #

Finally, please take some time to reflect on why adding or removing members in a Raft cluster requires special procedures.

Feel free to share your thoughts and discuss with me in the comments section. See you in the next lesson!