13 Overview What Issues Does a Distributed System Need to Solve

13 Overview - What Issues Does a Distributed System Need to Solve #

After learning about storage engine-related content, starting from this lecture, we will enter a new module - the most core part of distributed databases, which is distributed systems.

One important characteristic of distributed databases compared to traditional databases is their distributed nature, which stems from the development of distributed theory, especially data distribution-related theories. Compared to stateless distributed systems, stateful databases in the distributed domain will face more challenges.

As an introduction to this module, I will present a series of questions to you. In the subsequent courses, I will answer these questions one by one. Now let’s start by discussing the database in the distributed mode, beginning with the failure model.

Failure Model #

A distributed system involves multiple nodes that are connected via a network. Each node maintains its local state and synchronizes these states with each other through the network. At the same time, nodes need to access a time component to obtain the current time. For a distributed system, time can be divided into logical time and physical time. Logical time is generally implemented as a monotonically increasing counter, while physical time corresponds to real-world time provided by the operating system.

These are the various concepts involved in a distributed system, which seem simple but in reality, the consensus in the industry about distributed systems is that none of the above steps are reliable. “Unreliability” permeates the entire lifecycle of distributed systems. Summarizing these uncertainties becomes the problem solved by the failure model.

Before introducing the specific content of the failure model, let’s open up our thinking and see what specific reasons cause the reliability issues in distributed systems.

Causes of Failure #

When discussing the unstable factors within a distributed system, people first think of network issues. However, one easily overlooked aspect is that remote nodes may also fail when processing requests. A common misconception is that remote execution will immediately return results, but this assumption is very unreliable. Because the processing capabilities and runtime environments of remote nodes are unknown, we cannot assume that they will always respond to our requests in a fixed pattern.

Another situation is that when requests arrive at remote nodes, they may not be processed immediately but instead be placed in a queue for buffering. This is beneficial for improving the throughput of remote nodes, but it introduces latency to a certain extent, profoundly affecting the interaction pattern. The way to handle these problems is to introduce fault detection (which I will discuss in the next lecture) to observe the running status of remote nodes and take different countermeasures for different problems.

The second common misconception is that the time on all nodes is consistent. This misunderstanding is very common and dangerous. Although tools can be used to synchronize time within the cluster, keeping the system time consistent is very difficult. If we use the physical time generated by different nodes for consistency calculation or sorting, the result will be very unreliable. Therefore, most distributed databases use a dedicated node to generate globally unique logical time to solve the above problem. Some distributed databases, such as Spanner, use precise services like atomic clocks to solve the problem of time consistency.

Another problem with local physical time is the occurrence of backtracking, which means obtaining a time, performing several steps, and then obtaining the current time again, which may be earlier than the previous time. In other words, we cannot assume that the system’s physical time is monotonically increasing. This is another important reason to use logical time.

However, local physical time still plays an important role in certain parts of distributed systems, such as judging remote node timeouts. But based on the above two points, when implementing distributed algorithms, we should consider time factors to avoid potential issues.

The distributed problems mentioned above are concentrated at the node level, while another major category of problems is caused by the network. The most classic problem caused by the network is network partition, which means that the nodes in a distributed system are divided into different small blocks due to network failures. The most challenging part is that the nodes within these small blocks can still provide services. However, because they cannot perceive each other’s existence well, inconsistency problems arise. We have discussed this in more detail in module 1 “<05 | Consistency and CAP Theorem: Why Do We Need Distributed Consistency>”.

It is worth noting that the problems caused by network partition are difficult to solve because they are very difficult to discover. This is influenced by the complex topology of the network environment and many participants. Therefore, we need to design complex algorithms and use methods such as chaos engineering to solve such problems. One final point to emphasize is that a single read failure can cause a large-scale cascading effect, amplifying the impact of the failure. This effect is known as the famous “avalanche” phenomenon. It is important to note that this amplification of failures is often a result of a mechanism designed to stabilize the system. For example, when a bottleneck occurs in the system, a new node is added but it needs to synchronize data before it can provide services externally. The large-scale synchronization of data can potentially strain the resources of other nodes, especially network bandwidth, resulting in the entire system being unable to provide services externally.

There are two ways to solve cascading failures: backoff algorithms and circuit breakers. Backoff algorithms are widely used in API design. As mentioned earlier, remote nodes may experience temporary failures, so retries are needed to maximize the success of the requests. However, frequent retries can cause resource exhaustion and crashes in the remote nodes. Backoff algorithms rely on the client to ensure the high availability of the server. On the server side, a direct way to protect against cascading failures is through circuit breakers. If the number of requests to a server exceeds a threshold, the system will interrupt requests to that service, thus alleviating the system pressure.

These are the common failures in distributed systems. Although these failures may seem straightforward, the approaches to solving them can be quite scattered. Fortunately, previous research has summarized some models to address these problems. Next, I will introduce three typical failure models.

Crash Failure #

When a process stops working completely after a failure, it is called a crash failure. This is the simplest failure scenario and the result is very predictable. This failure mode is also known as crash-stop failure, emphasizing that the failed node does not need to participate in the distributed system anymore. This mode is easy to predict because after other nodes perceive the failure, they can continue to provide services without considering the complex issues caused by the failed node’s rejoining.

Although crash-stop mode has the above advantages, it is rarely used in real distributed systems. This is because it obviously causes resource waste. Therefore, we generally adopt the crash recovery mode to reuse resources. When it comes to crash recovery, the most common approach is to restart the failed node, go through a certain recovery process, and then reintegrate it into the network. Although this is a mainstream mode, the most popular mode is actually to generate backup nodes through data replication and then perform fast hot swapping.

Crash failure can be considered as a special case of omission failure. From the perspective of other nodes, it is difficult to tell whether a node’s unresponsive service is due to a crash or due to missing messages. So what exactly is omission failure?

Omission Failure #

Compared to crash failure, omission failure is more unpredictable. This mode emphasizes whether the message has been executed by the remote node.

The failure can happen in one of the following ways:

The message is not delivered to the remote node after being sent.
The remote node skips the message processing or is unable to execute it (a special case is crash failure when the node cannot process messages).
The result of the latter’s processing cannot be sent to other nodes.

In short, from the perspective of other nodes, the message sent to the node seems to disappear without any response.

As mentioned earlier, network partition is a typical case of omission failure. In this case, messages between some nodes can be sent and received normally, but there are difficulties in sending messages between certain nodes. If a crash failure occurs, all nodes in the cluster will be unable to communicate with it. Another typical case is when the processing speed of a node is much slower than the system’s average level, resulting in its data always being outdated, and yet it does not crash and still sends this outdated data to other nodes in the cluster.

When a remote node misses a message, we can alleviate this problem through reliable connection methods such as resending. However, if the message still cannot be delivered in the end, and the current node continues to provide services, then the missed failure will occur. In addition to the two scenarios mentioned above, missed failure can also occur in situations such as network overload and full message queue.

Now let me introduce the last type of failure model, which is the Byzantine failure.

Byzantine Failure #

Byzantine failure, also known as arbitrary failure, is the most unpredictable compared to the previous two failures. The term “arbitrary failure” means that the participating nodes produce inconsistent responses to a request, with one saying that the current data is A, while the other says it is B.

This failure is often caused by program bugs and can be mitigated by strict software development processes. However, we all know that bugs are difficult to avoid in production systems, especially problems caused by system version differences are extremely common. Therefore, during runtime, some systems do not trust the data obtained directly from remote nodes, but adopt cross-checking methods to obtain correct results as much as possible.

Another type of arbitrary failure is when some nodes intentionally send incorrect messages in order to disrupt the normal operation of the system and profit from it. Blockchain-based digital currency systems, for example, use a positive incentive mechanism (BFT) to ensure that the majority of nodes in the system do not “misbehave” (the benefits of doing the right thing outweigh those of misbehaving).

These are the three common failure models. The majority of the content in Module Three mainly focuses on crash recovery scenarios. Now let’s summarize the sequence of the upcoming explanations in this module.

Error Detection and Leader Election #

To solve the failure problem, the first step is to detect it. In the beginning of this module, we will study what methods can be used to detect faults in the system. Currently, there are many ways in the industry to detect the occurrence of failures, and they balance between ease of use, accuracy, and performance.

And one important application area of error detection is leader election. By using error detection techniques to check the health status of leader nodes, we can determine whether to select a new node to replace the failed leader node. One of the main roles of a leader node is to mitigate the possibility of system failures. We know that the cost of synchronizing peer-to-peer states in a system is very high, so if we can select a leader node to coordinate the synchronization, it will greatly reduce the system load and avoid some failures.

Once a failure is detected, the next step is to solve it.

Replication and Consistency #

Fault-tolerant systems generally use replication technology to create multiple replicas to provide system availability. This ensures that the system can still provide normal responses when some nodes in the system fail. The need for data synchronization arises from having multiple replicas, and consistency is a prerequisite for data synchronization. As I described in Module One, without replication technology, consistency and synchronization are impossible.

In Module One, we discussed CAP theory and strong consistency models, which are both within the scope of data consistency. In this module, we will continue to discuss client consistency, also known as session consistency. At the same time, we will discuss eventual consistency, a weak consistency model. Eventual consistency allows for inconsistent states in the system, but we still want to make the system as consistent as possible, so we introduce anti-entropy measures to solve the inconsistency between replicas.

We will then continue to discuss distributed transactions, which are related to consistency but have clear differences. Compared to classical transactions discussed in Module Two, distributed transactions are more special because they need to address various failure scenarios mentioned earlier, such as handling split-brain issues through transaction coordination.

Consensus #

Finally, we will introduce the essence of distributed systems: consensus algorithms. Many of the concepts we’ve discussed so far, including error detection, leader election, consistency, and distributed transactions, are covered under the umbrella of consensus algorithms. These algorithms are important components of modern distributed databases.

Consensus algorithms were developed to address the Byzantine Generals’ Problem. In simple terms, in the past, the Byzantine Generals’ Problem was considered a logical dilemma that illustrated the communication issues that may arise when a group of Byzantine generals attempt to reach a unanimous decision on the next course of action.

The dilemma assumes that each general has their own army, and each army is positioned in different locations surrounding the city they intend to attack. These generals need to reach a consensus on whether to attack or retreat. As long as all the generals reach a consensus and coordinate their actions accordingly, it does not matter whether they attack or retreat.

Based on the research on the famous FLP (Fischer-Lynch-Paterson) Impossibility Problem, the Byzantine generals face three dilemmas:

The generals do not have a unified clock or a way to synchronize time.
They cannot determine whether other generals have been defeated.
Communication between the generals is completely asynchronous.

Due to these dilemmas, there is no way to achieve consensus among the generals within a specific timeframe. In other words, consensus algorithms are completely impossible under the aforementioned dilemmas.

However, consensus algorithms use logical clocks to provide a unified notion of time and introduce error detection techniques to determine the state of participating nodes. This enables the achievement of consensus in distributed systems under fully asynchronous communication. In the last part of this module, I will introduce several classic consensus algorithms and present their use cases.

Consensus can solve the problem of omission failures, as long as a majority of nodes in the system reach consensus, even if the remaining nodes miss the message, they can still provide correct data externally.

Summary #

This lecture serves as an introduction to Module 3. I first introduced the concept of failure models, which are guidelines for describing various possible behaviors within a distributed database. Then, based on the failure models, I outlined the structure of this module.

According to different objectives, distributed algorithms can exhibit various behavioral patterns, which are summarized in the table below along with the corresponding lectures.