14 Error Diagnosis How to Ensure Stability in a Distributed System

14 Error Diagnosis - How to Ensure Stability in a Distributed System #

After learning from the previous lecture, I believe you already have an understanding of the main problems that distributed systems in the field of distributed databases focus on solving, which is designing algorithms and solving various stability issues around failure models.

To solve a problem, the prerequisite is to detect it. In this lecture, I will talk about how to detect errors within a system, which is a precondition for the algorithms that will be introduced later. For example, as mentioned in the previous lecture, if we do not have a means of failure detection, we cannot solve the Byzantine Generals Problem and will fall into the situation described by the FLP conjecture, thus unable to achieve a usable consensus algorithm. It is important to note that failure is not limited to node crashes, but mainly from the perspective of other nodes, where a node is unable to respond or experiences increased latency, thus reducing the overall availability of the system.

In this lecture, I will start by discussing several groups of characteristics that affect the performance of detection algorithms. I will then begin by introducing the well-known heartbeat algorithm, followed by exploring several variants of its improved versions. Finally, I will introduce the popular Gossip protocol used in large-scale distributed databases, especially in leaderless databases.

Now let’s start with the factors that affect algorithm performance.

Factors that Affect Algorithm Performance #

Failures can occur in the connections between nodes, such as packet loss or increased latency, or within the node processes themselves, such as node crashes or slow processing. It is actually difficult to distinguish whether a node is processing slowly or completely unable to process requests. Therefore, all detection algorithms need to balance between these two states. For example, after finding that a node is unresponsive, it is usually detected again after a specific delay time to more accurately determine which state the node is in.

Based on the above reasons, we need to measure the characteristics of an algorithm through a series of indicators. First, any algorithm must adhere to a set of characteristics: liveness and safety, which are necessary conditions for an algorithm.

Liveness refers to the ability to safely handle any failed messages, which means that if a node fails and cannot respond to normal requests, it will definitely be detected by the algorithm and not be missed.
Safety, on the other hand, means that the algorithm does not generate any abnormal messages that would classify normal nodes as faulty. In other words, if a node fails, it has truly failed and is not just temporarily slow as mentioned above.

Another necessary condition is the completeness of the algorithm. Completeness means that the algorithm will produce results within the expected time, meaning it will eventually generate a detection result that satisfies both liveness and safety, without staying indefinitely in a certain state without any results. This is actually a feature that any distributed algorithm needs to achieve.

The three characteristics mentioned above are necessary conditions for failure detection. The following pair of concepts that I will introduce can be chosen between depending on the different usage scenarios.

The first is the efficiency of algorithm execution, which refers to how quickly an algorithm can obtain the results of failure detection. The second is accuracy, which represents the precision of the obtained results, and this precision is the degree to which liveness and safety are achieved. An inaccurate algorithm either fails to detect already failed nodes or mistakenly labels nodes as failed when they are not.

Efficiency and accuracy are considered to be trade-offs. If we want to improve the efficiency of an algorithm, it will inevitably lead to a decrease in accuracy, and vice versa. Therefore, when designing a failure detection algorithm, we need to balance these two characteristics and propose different criteria for different scenarios.

Based on the above criteria, let me start by introducing the most commonly used failure detection algorithm—the heartbeat detection method—and its various variants.

Heartbeat Detection Method #

The heartbeat detection method is widely used mainly because it is simple and intuitive. We can directly understand it as a portable heart rate monitor. Once this device fails to detect a heartbeat, an alarm will be triggered.

There are many ways to perform heartbeat detection. Here, I will introduce two methods based on timeout and an indirect detection method to improve detection accuracy.

Based on Timeout #

Timeout-based heartbeat detection generally includes two methods.

Sending a ping packet to a remote node. If the node can return a correct response within a specified time, we consider it an online node. Otherwise, it will be marked as a failure.
A node sends specific data packets (called heartbeat packets) to surrounding nodes at a fixed frequency, and the surrounding nodes determine the health status of the node based on the received frequency. If no data packets are received within the specified time, the node is considered offline.

As you can see, although these two methods have different implementation details, they both involve a concept of “specified time,” which is the timeout mechanism. Now let’s take a closer look at the first method, using the following picture as an example.

Drawing 0.png

Figure 1 simulates two consecutive heartbeat accesses. Node 1 sends a ping packet, and node 2 returns a pong packet within the specified time. As a result, node 1 determines that node 2 is alive. However, in real scenarios, situations like Figure 2 often occur.

Drawing 2.png

Figure 2 Heartbeat access in a real scenario

As you can see, after node 1 sends a ping, node 2 does not return a pong within the specified time. At this point, node 1 sends another ping. This situation indicates that node 2 is experiencing a delay. Occasional delay is extremely common in distributed scenarios, so the timeout-based heartbeat detection algorithm needs to set a timeout threshold. Only when the number of timeouts exceeds this threshold, the remote node is considered offline to avoid the occasional delay affecting the accuracy of the algorithm.

From the above description, it can be seen that the timeout-based heartbeat detection method sacrifices the efficiency of the algorithm in order to improve its accuracy. Are there any ways to improve the efficiency of the algorithm? Next, I will introduce a non-timeout-based heartbeat detection algorithm.

Not based on Timeout #

The non-timeout-based heartbeat detection algorithm is based on the theory of asynchronous systems. It maintains a global list of node heartbeats, which records the heartbeat status of each node, making it easy to see the health status of nodes in the system. Therefore, besides improving detection efficiency, this algorithm can also easily obtain the health status of all nodes. How is this global list generated? The following diagram shows the circulation process of this list among nodes. Drawing 4.png

Figure 3 Flow process of the global list between nodes

From the figure, we can see that this algorithm needs to generate a main path between nodes, which is the most frequently traversed path by data flow between nodes and includes all the nodes in the cluster. As shown in the above figure, this path is from node 1, passes through node 2, and finally reaches node 3.

At the beginning of the algorithm, each node first records itself in the table and then sends the table to node 2. Node 2 first increments the counter of node 1 in the table by 1, then records itself in the table, and sends it to node 3. Similar to node 2, node 3 increments the counters of all the nodes in the table by 1 and then records itself in the table. Once node 3 finds that all the nodes have been recorded, it stops propagating the table.

In a real environment, the nodes are not linearly arranged like in the example, but it is very likely that a node will connect to many nodes. One advantage of this algorithm is that even if the connection between two nodes occasionally fails, as long as the remote node can be accessed by at least one node, it has the opportunity to be recorded in the list.

This algorithm is not based on timeout design, so it can quickly obtain the failed nodes within the cluster. It can also determine the health of a node based on the judgment of other nodes. However, it also has the problem of suppressing abnormal compute nodes. The counters recorded by these abnormal nodes may incorrectly label a normal node as abnormal, resulting in a decrease in the accuracy of the algorithm.

So, is there a method to improve the judgment of a single node? Now I will introduce an indirect detection method.

Indirect Detection #

The indirect detection method can effectively improve the stability of the algorithm. It divides the entire network into groups, and we do not need to know the health status of all the nodes in the network. Instead, we select some nodes in each subnet, and they will inform their neighboring nodes of their health status.

Drawing 6.png

Figure 4 Indirect detection method

As shown in the diagram, node 1 cannot directly determine whether node 2 is alive, so it asks its neighboring node 3 for information. Node 3 then asks node 2 about its health status and returns this information to node 1.

The advantage of this algorithm is that it does not need to broadcast the heartbeat detection. Instead, through limited network connections, it can detect the health status of each group in the cluster and therefore know the overall health status of the entire cluster. Due to the use of multiple nodes within the group for detection, the accuracy of the algorithm is greatly improved compared to having a single node perform the detection. At the same time, we can perform the detection in parallel, so the convergence speed of the algorithm is also fast. Therefore, it can be said that the indirect detection method achieves a good balance between accuracy and efficiency.

However, in large-scale distributed databases, the heartbeat detection method faces challenges in efficiency. So, what algorithm is better at handling these challenges? Now let me introduce the Gossip protocol detection method.

Gossip Protocol Detection #

In addition to heartbeat detection, the Gossip protocol detection method is a commonly used detection solution in large-scale distributed databases. The principle of Gossip is that each node detects the nodes adjacent to it, so as to quickly discover abnormal nodes in the system.

The details of the algorithm are that each node has a global node list, from which it selects some nodes for detection. If the detection is successful, the success counter is increased, and the latest detection time is recorded. Then, the node periodically synchronizes its detection list with the neighboring nodes, and when a neighboring node receives this list, it merges it with its own local list. Finally, all the nodes in the system will know the health status of the entire system. If certain nodes do not respond correctly, they will be marked as failed and processed accordingly. Note that it is important to set an appropriate threshold to prevent normal nodes from being marked as errors. The Gossip algorithm is widely used in leaderless distributed systems, with Cassandra being a famous example that employs this detection technique.

We can see that this detection method absorbs some advantages of the indirect detection methods mentioned earlier. The determination of whether a node should be considered failed is derived from the results of multiple nodes, rather than being made by a single node. This significantly enhances the stability of the system. However, this detection method will greatly increase the number of messages in the system, so it is crucial to select appropriate data packets to optimize this mode. I will provide the answer to this problem in “Chapter 17 | Reliable Data Propagation: How Does Entropy Theory Help Databases Work Reliably” when I introduce the Gossip protocol in detail.

Cassandra, as the main case of Gossip detection, also uses another method to evaluate if a node has failed, which is the φ value detection method.

φ Value Detection #

Most of the detection methods mentioned earlier use binary values to represent the detection results, where a node is either healthy or failed, black or white. The φ value detection method introduces a variable that is a numerical value used to evaluate the likelihood of a node’s failure. Now let’s see how this value is calculated.

First, we need to generate a time window for detecting message arrivals, which stores the latency of the most recent detection messages. Based on the values within this window, we use a certain algorithm to “predict” the latency of future messages. When the message actually arrives, we calculate the φ value using the actual value and the predicted value.

Next, we set a threshold for the φ value. Once it exceeds this threshold, we can mark the node as failed. This detection mode allows dynamic adjustment of the threshold based on the actual situation, enabling the optimization of the detection scheme. Additionally, if combined with the Gossip detection method, it can ensure that the data within the window is more representative and that the calculation of the φ value is not affected by the anomalies of individual nodes. Therefore, this evaluation detection method and Gossip detection have a natural connection.

From the details of the algorithms mentioned above, it is easy for us to design multiple components required by this algorithm.

Latency Collector: Collects the latency of nodes to construct the latency window.
Analyzer: Calculates the φ value based on collected data and determines if a node has failed based on the threshold.
Result Executor: Once a node is marked as failed, the subsequent processing flow is triggered by the result executor.

You can see that this detection mode transforms a binary judgment into a continuous value judgment, changing a switch into a progress bar. This mode is actually widely used in the field of status judgment, such as the Apdex metric in the APM field, which abstracts application health into a score for more granular performance judgment. We can see that although these algorithms are somewhat complex, they can effectively determine the status of a system.

Summary #

The content of this lecture is relatively simple and easy to understand, but it is extremely important and widely applicable. As the foundation of most distributed algorithms, the failure detection process discussed today is included in all the algorithms I will introduce later.

The algorithms in this lecture strike a balance between accuracy and efficiency. Some use point-to-point heartbeat patterns, some employ Gossip and message broadcast patterns, some rely on a single metric for judgment, and others use estimated continuous transformation values… They each have their own advantages and disadvantages, but all strive to balance these two characteristics. Of course, simplicity is also used as a metric to measure the practicality of algorithms, which aligns with the Unix philosophy that simplicity is often the best tool for complexity.

Most distributed databases use a master-slave mode, so the failure detection is generally performed by the master node. This approach effectively controls the number of messages within the cluster. In the next lecture, I will introduce how to select a leader node in a cluster.