15 Leader Election How to Safely Coordinate Operations in a Distributed System

15 Leader Election - How to Safely Coordinate Operations in a Distributed System #

In this lecture, let’s talk about how to synchronize data in a distributed database, and in general, in distributed systems.

You may have noticed a fact: data synchronization is a very costly operation, and if all participating nodes need to perform operations on each other during the synchronization process, the communication overhead will be tremendous.

As shown in the following figure, as the number of participating nodes increases, the communication cost increases gradually, which will eventually lead to data inconsistency within the cluster. This phenomenon will be further amplified, especially in super-large and geographically dispersed cluster networks.

Image 8.png

In order to reduce synchronization communication overhead and the number of participating nodes, some algorithms introduce a “leader” (sometimes called a coordinator) to coordinate data synchronization in a distributed system.

Leadership Election #

Usually, all nodes in a distributed system are equal, and any node can assume the role of a leader. Once a node becomes a leader, it generally holds this role for a considerable period of time, but it is not a permanent role. The reason is quite easy to understand: nodes may crash and fail to fulfill their leadership responsibilities.

In real life, if your leader is unable to perform his duties due to personal issues, the organization will select a new person to replace him. Similarly, after the leader node crashes, any other node can start a new round of elections. If elected, it takes on the leadership responsibility and continues to work from where the previous leader node left off.

The leader node plays the role of coordinating the entire cluster, and its general responsibilities include:

Controlling the total order of broadcast messages;
Collecting and storing global state;
Receiving messages and propagating and synchronizing them among nodes;
Performing system resets, generally after failures, during initialization, or when important system states are updated.

The cluster does not often go through the leader election process, but it usually triggers elections in the following two scenarios:

Triggering the first leader election during initialization;
The current leader crashes or cannot communicate.

Key Properties in the Election Algorithm #

When a cluster enters the election process, the nodes in it apply an election algorithm to elect a leader, and these election algorithms generally have two properties: “safety” and “liveness”. These are two very important and fundamental properties originally proposed by Leslie Lamport, the founder of distributed computing.

Before explaining the meanings of these two properties, let’s imagine how you elect a leader in your work life. Usually, the leader comes from a group of candidates, and the election rules should include the following two points.

The election must produce a leader. If there are two leaders, whom should the subordinates obey? The leader election is originally intended to solve the coordination problem, and having multiple leaders not only fails to solve this problem but also brings about bigger problems.
The election must have a result. The ideal state is that the leader election should have a result within an acceptable time for everyone. If the leader is not elected for a long time, it will inevitably cause the organization to be unable to carry out normal work. Without anyone to coordinate and arrange the work, the organization will become chaotic and disorderly.

The above two rules correspond to the two properties of the algorithm.

The first rule corresponds to the “safety” property of the algorithm, which ensures that there is at most one leader at a time and completely eliminates the possibility of “split brain” (the cluster is divided into more than two parts, and multiple leader nodes that do not know about each other are produced). However, in practice, many leader election algorithms violate this property. We will explain how to solve this problem in detail when we introduce “split brain”.

The second rule represents the “liveness” property of the election algorithm, which ensures that most of the time, there will be a leader in the cluster, and the election will eventually be completed to produce this leader, so the system should not be indefinitely stuck in the election state.

An algorithm that satisfies the above two properties is considered an effective leader election algorithm.

Leadership elections and distributed locks #

Leadership elections and distributed locks have a high degree of overlap at the algorithm level. The former selects a node as the leader, while the latter serves as the lock holder. Therefore, many developers often confuse the two. Now, let’s compare the differences between leadership elections and distributed locks.

Distributed locks ensure that in a concurrent environment, some mutually exclusive resources, such as transactions and shared variables, can only be operated by one node at a time. They also need to satisfy the safety and liveness properties mentioned earlier, which means that an exclusive lock can only be allocated to one node at a time, and that node will not hold the lock indefinitely.

In theory, although they have many similarities, there are also quite significant differences. If a node holds an exclusive lock, other nodes do not need to know who currently holds the lock, as long as it is guaranteed that the lock will eventually be released and allow others to obtain it, which is what we previously referred to as “liveness”.

In contrast, the election process is completely different. Nodes in the cluster must know who the leader node is in the system, because other nodes in the cluster need to perceive the activity of the leader node to determine whether they need to enter the election process. Therefore, the newly elected leader must inform its subordinates of its role.

Another difference is that if a distributed lock algorithm exhibits bias towards certain nodes or groups of nodes, i.e., unfair locks, it will ultimately cause some non-priority nodes to never obtain shared resources, which contradicts “liveness”. Conversely, we generally hope that the leader node will remain in the leadership role for as long as possible until it stops or crashes, because the “senior” leader is more popular among others.

Solving the single point of failure problem #

In a distributed system, having a stable leader can help reduce the cost of state synchronization for remote nodes and reduce the number of message exchanges. At the same time, some operations can be performed within a single leader node, avoiding synchronization operations within the cluster. In systems that adopt a leadership mechanism, a potential problem is that the leader, being a single node, may become a performance bottleneck.

To overcome this, many systems partition their data into disjoint, independent subsets, each with its own leader, rather than having a single global leader. An example of a system that uses this approach is Spanner (which will be introduced in Lecture 17, “Distributed Transactions”). Because each leader node can also fail, detection and reporting of such failures is necessary. When this happens, the system must elect another leader to replace the failed leader.

The above provides an overview of the use cases and algorithm characteristics of leadership elections. So how does leadership election work?

Typical algorithms include: Bully Algorithm, ZAB (Zookeeper Atomic Broadcast), Multi-Paxos, and Raft. Except for the Bully Algorithm, the remaining algorithms use their own unique methods to simultaneously address leadership election, fault detection, and conflict resolution between competing leader nodes. Therefore, their connotations are far greater than the scope of leadership elections. Due to space limitations, I will provide a detailed explanation in the next lecture.

Based on the above reasons, I will now use the Bully Algorithm and its improved versions as an example to illustrate a typical leadership election process. The Bully Algorithm is simple and easy to converge, making it well-suited for ensuring “liveness”. At the same time, it also meets the requirements of “safety” in the absence of network partitions.

Classic Leadership Election Algorithm: Bully Algorithm #

This is the most commonly used leadership election algorithm, which uses the size of node IDs to elect a new leader. Among all active nodes, the node with the largest or smallest ID is selected as the master node.

The logic used here is that “the larger the ID, the higher the priority”:

Each node is assigned a unique ID. During the election, the node with the largest ID becomes the leader. Because the node with the largest ID “forces” other nodes to accept it as the leader, it is also known as a monarchy-style leader election: similar to the succession order of leaders in royal families, with the highest-ranking royal member inheriting the throne. If a node realizes that there is no leader in the system, it starts the election or the previous leader has stopped responding to requests.

The algorithm consists of 4 steps:

Each active node in the cluster searches for nodes with a larger ID than itself. If none exists, it sends a Victory message to other nodes, indicating that it is the leader.
If there are nodes with larger IDs, it sends Election messages to these nodes and waits for their response.
If no response is received from these nodes within a given time, it becomes the leader and sends a Victory message to nodes with smaller IDs.
If a node receives an Election message from a node with a smaller ID, it replies with an Alive message.

The above diagram illustrates the Bully leadership election algorithm:

Node 3 realizes that the previous leader 6 has crashed and starts a new election by sending election messages to nodes with larger IDs.
Nodes 4 and 5 respond with Alive because their IDs are larger than 3.
Node 3 notifies Node 5, the node with the largest ID that responded in this round.
Node 5 is elected as the new leader and broadcasts election information to notify nodes with lower rankings of the election result.

图片7.png One obvious problem with this algorithm is that it violates the “safety” principle (i.e., at most one leader can be chosen at a time). In the case of network partitioning, when nodes are divided into two or more subsets that work independently, each subset elects its own leader.

Another problem with this algorithm is that it strongly favors nodes with larger IDs, but if they are unstable, it can severely threaten the stability of the election and may result in unstable nodes being permanently re-elected. Unstable high-ranking nodes may propose themselves as leaders, fail shortly after, but win the election again in a new round, and then fail again, repeating the election process endlessly. This situation can be resolved by monitoring the liveliness of the nodes and evaluating their liveliness during the election based on these metrics.

Improvement of the Bully Algorithm #

Although the Bully algorithm is classic, it often does not perform well in practical applications due to its relative simplicity. Therefore, we will see various evolved versions of the Bully algorithm in distributed databases to address some real-world problems. However, it is important to note that its core remains the classic Bully algorithm.

Improvement 1: Fault-Tolerant Node List #

There are many variations of the Bully algorithm that have been developed to improve its performance in various scenarios. For example, we can use multiple backup nodes as the targets for fault tolerance after a leader node crashes, which shortens the re-election time. Each elected leader provides a list of fault-tolerant nodes. When a node in the cluster detects an anomaly in the leader, it sends a message to the highest-ranked candidate in the candidate list provided by that leader node to initiate a new round of election. If one of the candidates is elected, it becomes the new leader without going through a complete election process.

If the process itself that detects the leader’s failure is the highest-ranked process in the list, it can immediately notify other nodes that it is the new leader.

Image 6.png

The above figure shows the process using this optimization, where:

6 is the leader with the specified candidate list {5, 4}, and it crashes and exits. 3 notices this failure and contacts candidate node 5, which has the highest rank in the list.
5 responds to 3, indicating that it is alive, preventing 3 from contacting other nodes in the candidate list.
5 notifies other nodes that it is the new leader.

Therefore, if the first node in the candidate list is active, we will need fewer steps during the election process.

Improvement 2: Node Role Partitioning #

Another algorithm attempts to reduce the number of messages by dividing nodes into two subsets: candidates and ordinary nodes, where only one candidate node can eventually become the leader. Ordinary nodes contact the candidate nodes, choose the highest-priority node among them as the leader, and then notify the remaining nodes of the election result.

To solve the problem of concurrent elections, this algorithm introduces a random start-up delay, which causes different nodes to have different start times, ultimately resulting in one node initiating the election before other nodes. This delay time is usually greater than the round-trip time of messages between nodes. Nodes with higher priorities have lower delays, while nodes with lower priorities often have significant delays.

Image 5.png The above figure shows the steps of the election process, where:

Node 4 comes from the ordinary set and discovers the crashed leader 6, so it starts a new round of election by contacting all remaining nodes in the candidate set.
Candidate nodes respond and notify 4 that they are still alive.
4 notifies all nodes that the new leader is 2.

This algorithm reduces the number of participating nodes in leader election, which speeds up the convergence of the algorithm in large clusters.

Improvement 3: Invitation Algorithm #

The invitation algorithm allows nodes to “invite” other processes to join their groups instead of prioritizing between groups. This algorithm allows multiple leaders to be defined, resulting in each group having its own leader. Each node starts as the leader of a new group, with itself as the only member.

Group leaders contact other nodes that do not belong to their groups and invite them to join. If the invited node itself is a leader, the two groups are merged; otherwise, the invited node replies with the ID of the leader of its group, allowing the two leaders to contact each other directly and merge the groups, greatly reducing the steps required to merge.

Image 4.png The above figure shows the steps of the invitation algorithm, where:

Four nodes form four independent groups, with each node being the leader of its group. Node 1 invites node 2 to join its group, and node 3 invites node 4 to join its group.
Node 2 joins the group of node 1, and node 4 joins the group of node 3. Node 1, as the leader of the first group, contacts the leader of the second group, node 3, and the remaining members (in this case, 4) learn about the new leader, node 1.
The two groups are merged, and node 1 becomes the leader of the extended group. When a merger occurs, whether it is initiated by the leader of one group becoming the leader of the new group, or another leader becoming the leader of the new group, in order to minimize the number of messages required for the merged group, it is generally chosen for the leader with the larger ID to become the leader of the new group. This way, only nodes from the groups with smaller IDs need to update the leader.

Similar to other algorithms discussed, this algorithm uses a “divide and conquer” approach to converge the leader election. The invitation algorithm allows for the creation and merging of node groups without triggering a new election from scratch, thereby reducing the number of messages required to complete the election.

Improvement 4: Ring Algorithm #

In the ring algorithm, all nodes in the system form a ring, and each node knows the topology of the ring, including its adjacent nodes. When a node detects a leader failure, it starts a new election, and the election message is forwarded throughout the ring by each node contacting its successor (the next node in the ring closest to it). If the node is unavailable, it skips that node and tries to contact the next node in the ring until one of them responds.

Nodes contact their sibling nodes to collect all active nodes and form a set of available nodes. The node adds itself to the set before passing the set to the next node.

The algorithm performs a complete traversal of the ring. When the message returns to the node that started the election, the highest-ranked node is selected as the leader from the active set.

As shown in the above figure, you can see an example of such a traversal:

The previous leader 6 failed, and each node in the ring has saved a copy of the current topology of the ring from its own perspective.
Taking node 3 as an example, it initiates an election round by starting the traversal. At each step, the node traverses according to the established route. Since 5 cannot reach 6, it is skipped and goes directly to 1.
Since 5 is the node with the highest rank, node 3 initiates another round of messages to distribute information about the new leader.

An optimization of this algorithm is for each node to only announce the node it considers to be the highest-ranked instead of a group of active nodes, in order to save space: because the Max function for finding the maximum value follows the commutative property, knowing one maximum value is enough. When the algorithm returns to the node that started the election, the node with the highest ID is obtained.

In addition, since the ring can be split into two or more parts, each part will elect its own leader. This algorithm also does not have “safety”.

As mentioned earlier, in order for the system with a leader to function properly, we need to know the current status of the leader. Therefore, for the overall stability of the system, the leader must be continuously active and able to fulfill its responsibilities. To detect leader failure, fault detection algorithms, as I introduced in my previous lesson, can be used.

Solving Split-Brain Issue in Election Algorithms #

We need to be aware that all the algorithms discussed in this lesson are prone to the split-brain problem, which means that in the end, there may be two leaders in independent subnets, and these two leaders are not aware of each other’s existence.

To avoid the split-brain problem, we generally need to introduce a quorum to elect a leader. For example, in Elasticsearch for electing the cluster leader, the Bully algorithm is used in conjunction with a minimum quorum to solve the split-brain problem.

As shown in the above figure, there are currently 2 networks and 5 nodes. Let’s assume the minimum quorum is 3. A is currently the leader of the cluster, A and B are in one network, and C, D, and E are in another network, with the two networks connected.

When this connection fails, A and B can still connect to each other, but lose contact with C, D, and E. Similarly, C, D, and E can also be aware of each other, but cannot connect to A and B.

At this point, C, D, and E cannot connect to the original leader A. At the same time, the three of them meet the minimum quorum requirement of 3, so they start a new round of election. Let’s assume C is elected as the new leader, and these three nodes can work normally.

In the other network, although A was the previous leader, the number of nodes in this network is 2, which is less than the minimum quorum. Therefore, A will voluntarily give up its leader role, leading to the nodes in this network being marked as unavailable and refusing to provide services. This effectively avoids the problem caused by split-brain.

Summary #

Leader election is an important topic in distributed systems because having a fixed leader greatly helps reduce coordination overhead and improve system performance. The election process may be costly, but it doesn’t occur very frequently, so it doesn’t significantly impact the overall system performance. Having a single leader can become a bottleneck, but this problem can be solved by partitioning the data and using a leader for each partition, or by using different leaders for different operations.

Many consensus algorithms, including Multi-Paxos and Raft, typically involve an election process. However, the connotation of consensus algorithms is more comprehensive than simple election algorithms, so I will discuss them specifically in “Lecture 19 | Consensus Algorithms: Clearing up the Differences Between Paxos, Raft, and Other Algorithms in One Go”.

With greater capabilities come greater responsibilities. Although a leader node solves the problem of data synchronization within the system, it bears significant responsibilities and can cause severe impact if any problems occur. Therefore, a stable and efficient leader election algorithm is key to the leader pattern.

The status of the leader may change without the knowledge of the nodes, so the nodes in the cluster need to be promptly informed whether the leader is still active. In order to achieve this, we need to combine leader election with fault detection. For example, the stable leader election algorithm uses a unique stable leader and round-based timeout fault detection to ensure that the leader can be re-elected and retain its leadership position, as long as it does not crash and is accessible.