33 a Strange Case of Data Loss

33 A Strange Case of Data Loss #

During the use of a master-slave cluster, I encountered an issue: our cluster consisted of 1 master database, 5 slave databases, and 3 sentinel instances. During usage, we discovered that some of the data sent by clients were lost, directly impacting the reliability of the data at the business layer.

Through a series of troubleshooting steps, we learned that this problem was caused by a split-brain issue in the master-slave cluster.

In a split-brain scenario, there are two primary nodes in the master-slave cluster, both capable of receiving write requests. The most direct impact of a split-brain is that the clients do not know which primary node to write data to, resulting in different clients writing data to different primary nodes. In severe cases, split-brain can lead to further data loss.

So why does split-brain occur in a master-slave cluster? And why does it lead to data loss? How can we avoid split-brain from happening? In this lesson, based on the real problem I encountered, I will analyze and pinpoint the problem with you, helping you understand the causes, consequences, and countermeasures of split-brain.

Why does brain split occur? #

As I mentioned earlier, the initial issue I discovered was that the data sent by the client was lost in the master-slave cluster. So, first of all, we need to understand why the data is lost. Is there a problem with data synchronization?

Step 1: Confirm if there is a problem with data synchronization #

The most common reason for data loss in a master-slave cluster is that the data from the master has not been synchronized to the slave, and when the master fails and the slave is upgraded to become the new master, the unsynchronized data is lost.

As shown in the diagram below, the newly written data a:1 and b:3 in the master are lost because they were not synchronized to the slave before the master failed.

If data loss occurs in this situation, we can determine it by comparing the replication progress between the master and the slave, which is calculated by the difference between the master_repl_offset and the slave_repl_offset. If the slave_repl_offset on the slave is less than the original master’s master_repl_offset, then we can conclude that the data loss is caused by incomplete data synchronization.

When deploying the master-slave cluster, we also monitor the master_repl_offset on the master and the slave_repl_offset on the slave. However, after discovering the data loss, we checked the slave_repl_offset before the new master was upgraded and the master_repl_offset of the original master, and they were consistent. That is to say, this slave upgraded to be the new master had already been consistent with the data of the original master during the upgrade. So, why did we still have data loss from the client?

At this point, our first hypothesis was overturned. Then, we thought that all data operations are sent from the client to the Redis instance, so can we find the problem from the client’s operation log? Immediately, we turned our attention to the client.

Step 2: Investigate the client’s operation log and discover the occurrence of brain split #

While investigating the client’s operation log, we found that during a period of time after the master-slave switch, there was still one client communicating with the original master and did not interact with the upgraded new master. This means that there are now two masters in the master-slave cluster at the same time. Based on this clue, we thought of a problem that occurs when distributed master-slave clusters fail: brain split.

However, if different clients send write operations to two different masters, theoretically, it should only result in new data being distributed to different masters, and should not lead to data loss. So, why did we still have data loss?

At this point, our investigation was once again interrupted. However, in analyzing the problem, we always believed that “starting from the principle is a good way to trace the root cause.” Brain split occurs during the process of master-slave switch, and we guessed that there must be a missing link in the process of master-slave switch, so we focused our research on the execution process of the master-slave switch.

Step 3: Discover that it is brain split caused by false failure of the original master #

We use the sentinel mechanism for master-slave switching. When a master-slave switch occurs, it must be that a certain number (defined by the quorum configuration item) of sentinel instances and the heartbeat from the master have both timed out before the master is judged as objectively offline, and then the sentinels start the switching operation. After the sentinel switch is completed, the client communicates with the new master and sends request operations.

However, in the switching process, since the client is still communicating with the original master, it means that the original master did not actually fail (such as the master process crashing). We speculate that the master was unable to process requests for some reasons and did not respond to the sentinel heartbeat, which caused the sentinel to mistakenly judge it as objectively offline. As a result, after being judged as offline, the original master resumed processing requests, but at this time, the sentinel had not completed the master-slave switch yet, and the client could still communicate with the original master, so the write operations sent by the client would be written to the original master.

To verify that the original master is only a “false failure,” we also checked the resource usage monitoring records of the server where the original master is located.

Indeed, we saw a sudden increase in CPU utilization on the machine where the original master is located for a period of time. This was caused by a data collection program we deployed on the machine. Because this program basically used up the CPU on the machine, Redis master could not respond to the heartbeat. During this period, the sentinel judged the master as objectively offline and started the master-slave switch. However, this data collection program quickly returned to normal, and the CPU usage dropped. At this time, the original master resumed normal request serving.

Because the original master did not actually fail, we saw the communication records with the original master in the client’s operation log. By the time the slave was upgraded to the new master, there were two masters in the master-slave cluster. At this point, we have figured out the reason for brain split.

To help you deepen your understanding, let me draw another diagram to show the process of brain split occurrence.

After understanding the cause of brain split occurrence, combined with the analysis of the master-slave switch process, we quickly found the cause of data loss.

Why does brain-splitting lead to data loss? #

After the master-slave switch, once the slave database is upgraded to the new master database, the sentinel will instruct the original master database to execute the slave of command and perform a full synchronization with the new master database. During the final stage of the full synchronization, the original master database needs to clear local data and load the RDB file sent by the new master database. As a result, the new write data saved by the original master database during the master-slave switch is lost.

The following diagram vividly illustrates the process of data loss in the original master database.

Diagram

At this point, we fully understand the process and reason for this issue.

During the master-slave switch, if the original master database is only a “false failure,” it will trigger the sentinel to start the master-slave switch. Once it recovers from the false failure and begins to process requests again, it will coexist with the new master database, leading to brain-splitting. When the sentinel instructs the original master database to perform a full synchronization with the new master database, the data saved by the original master database during the switch is lost.

At this point, you must be very concerned about how to deal with the data loss caused by brain-splitting.

How to deal with split-brain problems? #

As mentioned earlier, data loss events in a master-slave cluster ultimately occur because of a split-brain situation. Therefore, we must find strategies to deal with split-brain problems.

Since the problem lies in the fact that the original master can still receive requests even after a false failure, we will start by checking the configuration options in the master-slave cluster mechanism to see if there are any settings that restrict the master’s request acceptance.

Upon investigation, we find that Redis has provided two configuration options to limit the request processing of the master:

min-slaves-to-write: This configuration option sets the minimum number of slave nodes required for the master to perform data synchronization.
min-slaves-max-lag: This configuration option sets the maximum delay (in seconds) for the acknowledgement (ACK) message that the slave sends to the master during data replication between master and slave.

With these two configuration options, we can easily handle split-brain problems. How can we do that specifically?

We can use the combination of min-slaves-to-write and min-slaves-max-lag by setting certain thresholds for them, let’s say N and T. The combined requirements of these two configuration options are as follows: there should be at least N slave nodes connected to the master, and the ACK message delay during data replication between the master and slave should not exceed T seconds. Otherwise, the master will no longer accept client requests.

Even if the original master is falsely failed, it cannot respond to sentinel heartbeats during the false failure period, nor can it synchronize with the slave nodes, resulting in an inability to perform ACK confirmation with the slave nodes. In this case, the combined requirements of min-slaves-to-write and min-slaves-max-lag cannot be met, and the original master will be restricted from accepting client requests. Therefore, clients cannot write new data to the original master.

When the new master comes online, only the new master can accept and process client requests. At this point, newly written data will be written directly to the new master. The original master will be demoted to a slave by the sentinel, even if its data has been cleared, no new data will be lost.

Let me give you an example.

Suppose we set min-slaves-to-write to 1, min-slaves-max-lag to 12 seconds, and down-after-milliseconds of the sentinel to 10 seconds. If the master gets stuck for some reason for 15 seconds, causing the sentinel to determine that the master is objectively offline and initiate a master-slave switch, and at the same time, none of the slave nodes can replicate data with the original master within 12 seconds, the original master will also be unable to accept client requests. In this case, after the master-slave switch is completed, only the new master can accept requests, avoiding split-brain and data loss problems.

Summary #

In this lesson, we learned about the split-brain problem that may occur during master-slave switching. Split-brain refers to the situation where both master nodes in a master-slave cluster can receive write requests. During the master-slave switching process in Redis, if a split-brain occurs, the client data will be written to the original master node. If the original master node is demoted to a slave, these newly written data will be lost.

The main cause of split-brain is the occurrence of false failures in the original master node. Let’s summarize the two reasons for false failures:

Other programs deployed on the same server as the master node temporarily occupy a large amount of resources (such as CPU resources), which limits the resource usage of the master node and makes it unable to respond to heartbeats in a short period of time. When other programs stop using resources, the master node returns to normal.
The master node itself encounters blocking situations, such as processing big keys or experiencing memory swapping (you can review the reasons for instance blocking summarized in Lesson 19), which makes it unable to respond to heartbeats in a short period of time. After the blocking of the master node is resolved, it can resume normal request processing.

To deal with split-brain, you can prevent its occurrence by configuring the parameters min-slaves-to-write and min-slaves-max-lag when deploying the master-slave cluster.

In practical applications, temporary network congestion may cause the ACK messages between the slave nodes and the master node to time out. In this case, it is not a false failure of the master node, and there is no need to prevent the master node from accepting requests.

Therefore, my suggestion is as follows: assuming there are K slave nodes, you can set min-slaves-to-write to K/2 + 1 (if K is equal to 1, set it to 1), and set min-slaves-max-lag to a tens of seconds (e.g. 10-20s). With this configuration, if more than half of the slave nodes have an ACK message delay of more than tens of seconds with the master node, we should prevent the master node from accepting write requests from clients.

In this way, we can avoid data loss caused by split-brain and, at the same time, won’t prevent the master node from accepting requests just because a few slave nodes cannot connect to the master node due to network congestion, which increases the robustness of the system.

One question per lesson #

As usual, I have a small question for you. Suppose we set min-slaves-to-write to 1, min-slaves-max-lag to 15 seconds, down-after-milliseconds for sentinels to 10 seconds, and it takes 5 seconds for a sentinel to initiate a master-slave switch. The master server gets stuck for some reason for 12 seconds. Will split-brain still occur at this point? Will any data be lost after the master-slave switch is completed?

Please feel free to share your thoughts and answers in the comments section. Let’s exchange and discuss together. If you find today’s content helpful, you are welcome to share it with your friends or colleagues. See you in the next lesson.