17 Reliable Data Transmission How Antithesis Theory Helps Ensure Database Reliability

17 Reliable Data Transmission - How Antithesis Theory Helps Ensure Database Reliability #

In the previous lecture, we introduced the concept of consistency, with eventual consistency being the lowest level of consistency. Under eventual consistency, data synchronization needs to occur between nodes for a certain period of time before the latest data can be distributed among the nodes. This requires the stable propagation of the newly generated data between these nodes.

However, reality can be harsh, as various faults can occur during data propagation, such as node crashes, network anomalies, and high delays caused by large amounts of synchronized data. Ultimately, this can result in significant data discrepancies between nodes within the eventual consistency cluster. Over time, the cluster will descend into a more chaotic state.

The scenario described above is an example of “entropy increase.” This is a concept from physics, which was popularized in the 2020 film “Tenet,” where entropy is described as being related to time, as if entropy increase is forward time and entropy decrease is time running in reverse.

In reality, entropy and time have an indirect relationship. In the 19th century, scientists discovered that in a closed system with no external forces, heat always propagates from hot objects to cold objects, leading to a theory: In a closed system without external forces, entropy always increases, and time flows forward in tandem with entropy increase. The film hypothesized that if entropy could be reduced, time should be able to flow in reverse.

The concept of entropy extends into various fields, generally indicating that a system tends to change towards a state of disorder. In an eventual consistency system, it represents the tendency for data to evolve towards disorder. At this point, we need to introduce an “anti-entropy” mechanism to apply “external forces” and eliminate the impact of “entropy increase” caused by the natural state.

After discussing so much, in simple terms, it means using external means to achieve data consistency among nodes in a distributed database. The means of anti-entropy include: foreground synchronization, background asynchronous synchronization, and the Gossip protocol. Now let me introduce them one by one.

Foreground Synchronization #

Foreground synchronization involves synchronizing data consistency through two foreground operations: read repair and hinted handoff.

Read Repair #

As entropy gradually increases, the system becomes increasingly chaotic. But if there are no read operations, this chaos will not be exposed. So, people came up with an idea: we can repair inconsistent data when a read operation occurs.

The specific operation is that the request is processed by a coordinating node, which queries data from a set of nodes. If data is missing in some of these nodes, the coordinating node sends the missing data to these nodes, repairing the data and achieving the purpose of anti-entropy.

Some of you may notice that this idea is related to the adjustable consistency we discussed in the previous lecture. In adjustable consistency, to meet consistency requirements, read operations will read data from multiple nodes to find the latest data result. Read repair goes further. After that, it synchronizes and repairs the lagging node data and finally sends the latest result back to the client. The process is shown in the following diagram.

Drawing 0.png

When repairing data, read repair can be performed in either blocking or asynchronous mode. Blocking mode is shown in the diagram above, where the final result is returned to the client after data repair is completed. Asynchronous mode, on the other hand, starts an asynchronous task to repair the data without waiting for the completion of the repair, returning to the client.

You might recall that the blocking mode of read repair actually satisfies the “read monotonic” mentioned in the previous lecture. This is because a value is read, and the next read is always based on the previous one. In other words, the synchronously repaired data can be propagated to the target node before the next read operation. Asynchronous repair does not guarantee this. However, blocking repair sacrifices a certain level of availability because it needs to wait for remote nodes to repair the data, while asynchronous repair does not have this problem.

When comparing messages, we have an optimization technique of using hash functions to compare data. For example, after the coordinating node receives a request from the client, it sends a read request to only one node and hash requests to other nodes. Then, it calculates the hash value of the complete request and compares it with the hash values returned by other nodes. If they are equal, it directly returns a response. If they are not equal, the repair process described above is carried out. One obvious advantage of this hashing pattern is that when the system is in a stable state, the cost of determining data consistency is small, so it can speed up the reading speed and effectively reduce system load. The commonly used hashing algorithms include MD5, etc. Of course, theoretically, there is a possibility of collisions in hashing algorithms, which means that some inconsistent states cannot be detected. First of all, we have to say that in real scenarios, the probability of such collisions is very low. Even if collisions occur, there will be other detection methods to repair the differences.

The above is the anti-entropy operation performed during the read operation. So how do we perform repair during the write phase? Now let me introduce hinted handoff.

Hinted Handoff #

The name “Hinted Handoff” sounds mystical. In fact, the principle is very clear. Let’s take a look at the process, as shown in the figure below.

Drawing 1.png

The client first writes to the coordination node. Then the coordination node distributes the data to two nodes, a process similar to writable consistency. Under normal circumstances, it can be guaranteed that the data in the two nodes is consistent. If one of the nodes fails, a new node will be started to receive the data after the failed node. This structure is generally implemented as a queue, namely the hinted handoff queue (HHQ).

Once the failed node recovers, HHQ will synchronize the data from the period when the node was offline back to that node, thus repairing the data lost due to the node being offline. This is the anti-entropy operation performed on the write node.

The front-end synchronous operations described above actually have a limitation, which assumes that the probability and scope of this entropy increase process are low. If the entropy increase occurs on a large scale, repair reading will increase reading latency, and even using asynchronous repair will cause high conflicts. The problem with the hinted handoff queue is that its capacity is limited, which means that for a node that has been offline for a long time, HHQ may not be able to save all its messages.

So is there any way to handle such large-scale and long-term inconsistencies? Now I’m going to introduce some solutions to this problem using the background asynchronous approach.

Background Asynchronous #

The synchronous solutions we introduced earlier mainly address recently accessed data, while the background asynchronous solutions we are going to introduce mainly target data that has been written for a long time, i.e., inactive data. With this approach, we can also perform full-data consistency repair work.

The focus of the background approach is different from the front-end approach. The front-end approach focuses on repairing data, while the background approach, due to the need to compare and process a large amount of inactive data, needs to focus on how to use fewer resources to perform data comparison. I’m going to introduce two comparison techniques: Merkle tree and bitmap version vectors.

Merkle Tree #

If you want to check the difference between two sets of data, the most intuitive way is to perform a full comparison. But this approach is very inefficient and cannot be implemented in actual production. With Merkle trees, we can quickly find differences between two sets of data. The diagram below shows a typical Merkle tree.

Drawing 2.png

The process of constructing the tree is:

Divide the data into multiple consecutive segments. Then calculate the hash value for each segment, obtaining the four values hash1 to hash4.
Then, group these four values in pairs, calculate hash5 using hash1 and hash2, and calculate hash6 using hash3 and hash4.
Finally, use hash5 and hash6 to calculate the top hash.

You will find that the way of identifying data differences is similar to binary search. First compare the top hashes of the two sets of data, and if they are not the same, proceed to the next level of comparison. This ultimately helps identify the range of differing data, thus reducing the number of data comparisons. With only a partial difference in the two sets of data, both can affect the final result of the top hash, allowing for a quick determination of whether the two sets of data are consistent.

Merkle trees combine the characteristics of checksum verification and binary trees, enabling us to quickly determine whether two sets of data have differences. However, if we are willing to sacrifice some accuracy in order to control the range of data to be compared, the bitmap version vector introduced below is an ideal choice.

Bitmap Version Vector #

Recent research has found that most data differences occur within a short period of time. Therefore, we can optimize for this scenario and avoid calculating the full set of data like Merkle trees do. The bitmap version vector algorithm was developed based on this idea.

This algorithm utilizes bitmaps, a memory-friendly and high-density data format, to record the synchronization status of recent data for each node. By comparing the bitmap data between nodes, differences can be discovered and the data can be repaired. Let me show you how this algorithm works with an example in the following diagram.

Drawing 3.png

If there are three nodes, each node contains a set of vectors for synchronizing data with other nodes. The diagram represents the data synchronization status of node 2. Currently, there are 8 data entries in the system, and from the perspective of node 2, none of the nodes have complete data. The dark gray area indicates continuous synchronized data, which is represented by a compressed value. The compressed values for nodes 1 to 3 are 3, 5, and 2 respectively. As we can see, node 2 has its own continuous data.

Once data synchronization becomes discontinuous, indicating the presence of gaps, we switch to using bitmaps to store the synchronization status. These are the light gray and white areas in the diagram. For example, if node 2 observes node 1, we can see that there are three consecutive data synchronizations, which can be represented by the status 00101 (light gray represents 1, and white represents 0). Here, 1 indicates data synchronization, while 0 indicates no synchronization. Node 2 can obtain the complete set of 8 data entries from nodes 1 and 3.

This vector list not only has advantages in terms of memory utilization, but also allows us to easily identify the targets that need data repair. However, it has an obvious drawback similar to the Hinted Handoff Queue (HHQ), which is limited storage. If there is a significant data skew, the vector will eventually overflow, rendering it unable to compare data differences. But not to worry, we can use the aforementioned Merkle tree for a full comparison.

Above, I have introduced some common anti-entropy measures, all of which are capable of solving the problem of data consistency. However, compared to the traditional leader node data synchronization, it is difficult to measure the speed of data synchronization using these measures, and there may be instances where some nodes do not synchronize for extended periods of time. So, is there a method that can improve the efficiency of data synchronization? The answer is affirmative, and that is the Gossip protocol.

Gossip Protocol #

The Gossip protocol can be considered as a widely adopted distributed protocol. Its name is quite expressive and can be humorously described as “spreading gossip.” You can imagine a village in Northeast China, where everyone gathers under a tree and shares news about Zhang’s family or Li’s family. Within a short while, everyone in the village knows about the news.

The Gossip protocol is similar to this scenario. Nodes actively exchange information with each other, ultimately achieving the rapid dissemination of messages. This protocol is based on the model of virus transmission. In the year 2020, we gained a deep understanding of virus transmission due to the COVID-19 pandemic. Therefore, I will now explain the message propagation pattern of the Gossip protocol using the model of virus transmission.

At the beginning, a node in the cluster generates a message, and its status is “infected.” We consider other nodes as “susceptible nodes,” similar to the susceptible population in the case of COVID-19. Once the message propagates from an infected node to a susceptible node, the susceptible node changes its status to “infected” and then continues to propagate the message. Here, a random function is used to select the target for spreading, which allows the “virus” to be well expanded to the entire cluster. Of course, if an infected node is unwilling to infect other nodes, it will be isolated, and the message on it will be removed after a period of time.

We can see that the Gossip pattern is very suitable for data synchronization in leaderless cluster, that is, no matter how many nodes are involved in the cluster, messages can be propagated robustly within the cluster. Of course, messages will be repeatedly propagated to the same node. When implementing the algorithm, we need to minimize this duplicate data.

Another important factor affecting the success or failure of the algorithm is how quickly the messages propagate within the cluster. The faster the propagation, the less inconsistent time and the less likely the messages are lost. Now I will describe the behavior of the algorithm through several features.

Number of exchange. It represents how many neighboring nodes are selected by a node to propagate data. It is easy to understand that when this value increases, data can be propagated faster. However, increasing this value will also increase the proportion of duplicate data, resulting in increased cluster load and decreased throughput. Therefore, we need to monitor duplicate data to adjust the number of exchanges in real time.
Propagation delay. This delay is different from the replication delay mentioned earlier. It describes the time it takes for the message to propagate to all nodes in the cluster. It depends on the number of exchanges and the cluster size. In a large-scale cluster, we should appropriately increase the number of exchanges and reduce the propagation delay of data.
Propagation stop threshold. When a node consistently receives duplicate data, we should consider weakening or even stopping the propagation of this data in the cluster. This process is vividly referred to as “interest attenuation”. We generally need to calculate the number of duplicates for each node and determine whether the data needs to stop propagating based on a threshold.

These are some characteristics of the Gossip propagation pattern. However, in practical production, we cannot construct a propagation network purely at random, as it would cause information overload in the network. We generally adopt some network optimization methods.

Network Optimization #

As mentioned earlier, one key to the success of the Gossip protocol is to control the number of duplicate messages, but to a certain extent, a certain number of duplicates can ensure the availability of messages, making the cluster more robust.

A balanced solution is to construct a temporary stable topology network structure. Nodes can build a subnet by detecting nodes that are relatively stable in the network. The subnets can then be connected to each other to build a one-way propagation, acyclic tree structure. This achieves a propagation structure similar to a master node network, which can effectively control duplicate messages and ensure that all nodes in the cluster can safely receive data.

However, this structure has obvious weaknesses, namely, the nodes connecting the subnets will become potential bottlenecks. Once such nodes fail, the subnets will become information islands, losing the robustness characteristics brought by the Gossip algorithm.

Is there an algorithm that can solve this island problem? We can use a hybrid mode to solve it, that is, simultaneously using a tree structure and the traditional Gossip random propagation structure. When the system runs stably, the tree structure is used to accelerate the propagation speed of information and reduce duplicate data. Once a failure is detected, the system degrades to Gossip mode for large-scale information synchronization to repair the problem.

Summary #

Eventual consistency allows for inconsistencies in the state between nodes, and anti-entropy mechanism helps to repair these inconsistencies.

We can use front-end read repair and hinted handoff to quickly fix recently occurring problems, or we can use background methods like Merkle trees and bitmap vector clocks to fix global consistency issues. If you need to synchronize data on a large scale and in a stable manner, the Gossip protocol will be your excellent choice.

With this, we can say that all the issues related to replication and consistency in distributed systems have been introduced. In the next lecture, we will enter the most core area of distributed data: distributed transactions. Hope to see you on time. Thank you.