06 Zoo Keeper and Curator Request You Stop Using Zk Client Above

06 ZooKeeper and Curator Request You Stop Using ZkClient Above #

In our previous discussion on simplifying the architecture of Dubbo, we mentioned that a Dubbo provider registers its service information as a URL with the registry center when it starts up. On the other hand, a Dubbo consumer subscribes to the provider information it is interested in from the registry center when it starts up. Only after this process can the provider and consumer establish a connection and proceed with further interactions. Thus, a stable and efficient registry center is crucial for microservices based on Dubbo.

Currently, Dubbo supports multiple open-source components, including Consul, etcd, Nacos, ZooKeeper, and Redis, as registry centers. Dubbo’s source code also includes corresponding integration modules for these components, as shown in the diagram below:

Drawing 0.png

ZooKeeper is the recommended registry center by the Dubbo team. It is the most commonly used implementation of a registry center in production, which is why we will be introducing the core principles of ZooKeeper in this lesson.

To interact with a ZooKeeper cluster, we can use the native ZooKeeper client or third-party open-source clients such as ZkClient and Apache Curator. As we will see later when discussing the specific implementation of the dubbo-registry-zookeeper module, Apache Curator is the underlying client used by Dubbo. Apache Curator is the most widely used ZooKeeper client in practice.

Core Concepts of ZooKeeper #

Apache ZooKeeper is a reliable and scalable coordination service for distributed systems. It is commonly used for purposes such as unified naming, configuration management, distributed cluster management, distributed locking, leader election, and more. Many distributed systems rely on ZooKeeper clusters for coordination and scheduling, including Dubbo, HDFS 2.x, HBase, Kafka, and so on. ZooKeeper has become a standard component of modern distributed systems.

ZooKeeper itself is a distributed application. The diagram below illustrates the core architecture of a ZooKeeper cluster.

2.png

Core architecture of a ZooKeeper cluster

  • Client node: From the perspective of the business, a client node is a node in a distributed application. It maintains a persistent connection with a server instance in the ZooKeeper cluster through ZkClient or other ZooKeeper clients and sends heartbeats periodically. From the perspective of the ZooKeeper cluster, it is a client that can actively query or operate on the data in the ZooKeeper cluster. It can also add listeners to certain ZooKeeper nodes (ZNodes). When the data of a monitored ZNode changes, such as when the ZNode is deleted, new child nodes are added, or the data in the ZNode is modified, the ZooKeeper cluster will immediately notify the client through the persistent connection.
  • Leader node: The leader node is the master node of the ZooKeeper cluster. It is responsible for handling write operations in the entire cluster and ensuring the ordering of transaction processing. Additionally, it is responsible for synchronizing data between all follower nodes and observer nodes in the cluster.
  • Follower node: Follower nodes are slave nodes in the ZooKeeper cluster. They can receive read requests from clients and return results to clients. However, they do not handle write requests; instead, they forward write operations to the leader node. Additionally, follower nodes participate in leader election.
  • Observer node: Observer nodes are special types of follower nodes in the ZooKeeper cluster. They do not participate in leader election but have the same functionality as follower nodes. The purpose of introducing observer nodes is to increase the read throughput of the ZooKeeper cluster. If the read throughput of ZooKeeper is increased solely by adding more follower nodes, there is a significant side effect: the write capability of the ZooKeeper cluster will be greatly reduced. This is because ZooKeeper needs the leader node to synchronize write operations with more than half of the follower nodes. By introducing observer nodes, the read throughput of the ZooKeeper cluster is significantly improved without reducing its write capability.

Now that we have an understanding of the overall architecture of ZooKeeper, let’s take a look at the logical structure of the data stored in a ZooKeeper cluster. ZooKeeper stores data in a tree-like structure called logical structure or hierarchy (as shown in the diagram below). Each node in this structure is called a ZNode. Each ZNode has a name that represents its path from the root of the tree (separated by “/”). Similar to a file system directory tree, each node in the ZooKeeper tree can have child nodes.

1.png

ZooKeeper Tree-like Storage Structure

There are four types of ZNode nodes:

  • Persistent node. Once created, a persistent node will exist until it is explicitly deleted, even if the client session that created the node expires.
  • Persistent sequential node. Like a persistent node, a persistent sequential node will also exist until it is explicitly deleted. The ZooKeeper server automatically appends a monotonically increasing numeric suffix to the name of the node during creation.
  • Ephemeral node. An ephemeral node will be automatically deleted by the ZooKeeper cluster when the ZooKeeper client session that created the node expires. One key difference from a persistent node is that no child nodes can be created under an ephemeral node.
  • Ephemeral sequential node. Similar to an ephemeral node, an ephemeral sequential node will be automatically deleted by the ZooKeeper cluster when the ZooKeeper client session that created the node expires. The ZooKeeper server also appends a monotonically increasing numeric suffix to the name of the node during creation.

Each ZNode maintains a stat structure, which contains metadata of the ZNode, including version number, operation control list (ACL), timestamp, and data length, as shown in the following table:

Drawing 3.png

In addition to the basic operations of creating, deleting, modifying, and querying ZNodes using the ZooKeeper client, we can also register a watcher to monitor changes to a ZNode, its data, and its child nodes. Once a change is detected, the corresponding watcher is triggered and the ZooKeeper client is notified immediately. Watchers have the following characteristics:

  • Active push. When a watcher is triggered, the ZooKeeper cluster actively pushes the updates to the client without the need for the client to poll.
  • One-time trigger. When data changes, a watcher is triggered only once. If the client wants to be notified of subsequent updates, it needs to re-register a new watcher after the previous watcher is triggered.
  • Visibility. If a client includes a watcher in a read request, the client cannot see the updated data until it receives the watcher message. In other words, the update notification precedes the updated data.
  • Order guarantee. If multiple updates trigger multiple watchers, the order in which the watchers are triggered is consistent with the order of the updates.

Overview of Message Broadcast Process #

All three types of nodes in a ZooKeeper cluster (Leader, Follower, and Observer) can handle read requests from clients because each node contains the same copy of the data and can return it directly to the client.

For write requests, if the client is connected to a Follower node (or Observer node), the write request will be forwarded to the Leader node by the Follower node (or Observer node). The following is the core process of how the Leader node handles write requests:

  1. After receiving a write request, the Leader node assigns a globally unique zxid (a 64-bit incremental ID) to the write request. By comparing the zxid values, the order consistency of the write operations can be achieved.
  2. The Leader distributes the message with zxid as a proposal to all Follower nodes through a first-in-first-out queue (creating a queue for each Follower node to ensure sequential delivery).
  3. When a Follower node receives a proposal, it first writes the proposal to the local transaction log. After the write transaction succeeds, it sends an ACK response to the Leader node.
  4. After receiving ACK responses from a majority of the Followers, the Leader node sends COMMIT commands to all Follower nodes and executes the commit locally.
  5. When a Follower receives the COMMIT command, it also performs the commit, and the write operation is completed at this point.
  6. Finally, the Follower node returns the corresponding response to the Client’s write request.

The following figure shows the core process of write operations:

Drawing 4.png

Write operation core process diagram

Crash Recovery #

In the write request processing flow above, if the Leader node crashes, the entire ZooKeeper cluster may be in one of the following two states:

  1. After receiving ACK responses from a majority of the Follower nodes, the Leader node broadcasts COMMIT commands to all Follower nodes and executes COMMIT locally, and at the same time responds to the connected clients. If the Leader crashes before the remaining servers receive the COMMIT command, the remaining servers will not be able to execute this message.
  2. If the Leader node crashes after generating the proposal, and other Followers have not received the proposal (or only a small number of Follower nodes have received it), the write operation will fail.

After the Leader crashes, ZooKeeper enters the crash recovery mode and re-elects a new Leader node.

ZooKeeper has the following two requirements for the new Leader:

  1. The new Leader must be able to broadcast and commit proposals that have been committed by the original Leader. This requires selecting a node with the largest zxid value as the Leader.
  2. The new Leader must be able to notify the original Leader and the already synchronized Followers to delete proposals that have not been broadcast or only partially broadcast, in order to ensure the consistency of the cluster data.

ZooKeeper uses the ZAB (ZooKeeper Atomic Broadcast) protocol for leader election. If we were to go into detail, there would be a lot of content. Here, we will briefly introduce the approximate process of ZooKeeper leader election using an example.

For example, there are 5 ZooKeeper nodes in the current cluster, with sids of 1, 2, 3, 4, and 5, and zxids of 10, 10, 9, 9, and 8 respectively. At this time, the node with sid 1 is the Leader node. In fact, zxid contains two parts: the epoch (high 32 bits) and the incrementing counter (low 32 bits). The epoch represents the current leader cycle and increases every time an election is conducted, preventing unnecessary re-election caused by the old Leader of the previous cycle reconnecting to the cluster after network isolation. In this example, we assume that the epochs of all nodes are the same.

At a certain moment, the server of node 1 crashes, and the ZooKeeper cluster starts the leader election process. Since the status information of other nodes in the cluster cannot be detected (in the Looking state), each node elects itself as a candidate for election. Therefore, the voting situations for nodes 2, 3, 4, and 5 are (2,10), (3,9), (4,9), and (5,8) respectively, and each node also receives votes from other nodes in the form of (sid, zxid) to identify each vote.

  • For node 2, it receives votes (3,9), (4,9), and (5,8). After comparison, it finds that its zxid is the largest, so it does not need to change its vote.
  • For node 3, it receives votes (2,10), (4,9), and (5,8). After comparison, it finds that node 2’s zxid is larger than its own, so it needs to change its vote to (2,10) and sends the updated vote to other nodes.
  • For node 4, it is the same, ultimately changing its vote to (2,10).
  • It is the same for node 5, and finally its vote becomes (2,10).

After the second round of voting, each node in the cluster will receive votes from other machines again, and then start to count the votes. If more than half of the nodes vote for the same node, that node becomes the new Leader. Here, node 2 becomes the new Leader.

At this time, the Leader node increments the epoch value by 1 and distributes the new epoch to each Follower node. After receiving the new epoch, each Follower node returns an ACK to the Leader node, bringing along its own maximum zxid and the history transaction log information. The Leader selects the largest zxid among them, updates its own history transaction log, and then synchronizes the latest transaction log to all Follower nodes in the cluster. Only when a majority of the Followers successfully synchronize, can this candidate Leader node become the official Leader node and start working.

Summary #

In this lesson, we have focused on introducing the core concepts of ZooKeeper and the basic working principle of ZooKeeper clusters:

  • Firstly, we introduced the roles and functions of each node in the ZooKeeper cluster.
  • Then we explained the logical structure of storing data in ZooKeeper and the related characteristics of ZNode.
  • Next, we explained the core process of reading and writing data in ZooKeeper clusters.
  • Finally, we analyzed the crash recovery process of ZooKeeper clusters through examples.

In the next lesson, we will introduce the related content of Apache Curator.