24 How Does a Distributed System With a Registry Discover Services

24 How Does a Distributed System with a Registry Discover Services #

Hello, I’m Tang Yang.

In the previous lesson, I introduced some key points in implementing an RPC framework. With an RPC framework, you can solve the problem of cross-network communication between services, which forms the foundation for microservice transformation.

However, after service decomposition, you need to maintain a larger number of fine-grained services. The first problem you need to address is how to make the RPC client aware of the address where the server is deployed. This is what we are going to talk about today - the problem of service registration and discovery.

Service Discovery #

Service registration and discovery is not a new concept, and you must have encountered it in previous projects, although you may not have paid much attention to it. For example, if you know that Nginx is a reverse proxy component, then Nginx needs to know the address of the application server in order to route traffic to the application server. This is the process of service discovery.

So how does Nginx achieve this? It configures the addresses of the application servers in a file.

This is indeed one way to solve the problem. In fact, I did the same thing in earlier projects. At that time, when the project was first split into services, the RPC server addresses were configured in the client code. However, this approach led to several problems:

First, when there is an urgent need to scale up, modifying the client configuration and restarting all client processes takes a long time.

Second, if a server fails, modifying all client configurations and restarting them is not efficient, and automatic recovery is not possible.

Finally, it is not possible to proactively redirect traffic away from a newly deployed RPC server. As a result, when restarting the server, the client may send requests to the restarted server before it has a chance to respond, resulting in slow or failed requests.

Therefore, we consider using a service registry to solve these problems.

There are many service registry components available in the industry, such as the classic ZooKeeper, ETCD used by Kubernetes, Alibaba’s Nacos for microservices, and Spring Cloud’s Eureka, among others.

These service registries have two basic functions:

First, they provide a storage for service addresses.

Second, when the contents of the storage change, they can push the updated information to the clients.

The second function is the main reason we use service registries. With a service registry, we no longer need to restart servers when we need to scale up or when a server fails. With the use of a service registry component, the RPC communication process becomes like this:

From the diagram, you can see a complete service registration and discovery process:

The client establishes a connection with the service registry and informs it of the services it is interested in.
After the service registers itself with the service registry, the registry notifies the clients of the latest service registration information.
With the service address from the registry, the client can make calls to the service.

From this process, you can see that with a service registry, the addition and removal of service nodes are transparent to the client. In addition to avoiding the need to restart clients for dynamic updates to service nodes, this approach also allows for graceful shutdown.

Graceful shutdown is a crucial consideration in system development. If a service is forcefully stopped, requests that have already been sent to the server may be terminated before they can be processed, resulting in request failures and service instability. Therefore, when a service is shutting down, it needs to first stop accepting new requests, process all the pending requests, and then shut down. For example, a message queue processor needs to process all the messages read from the queue before it can exit.

For RPC services, we can remove the service from the service registry’s service list, observe that there is no traffic going to the RPC service, and then stop the service. With graceful shutdown, the impact on clients is minimized when the RPC service restarts.

In this process, service registration and removal are accomplished by the service server actively registering and unregistering with the service registry, which works fine in normal scenarios. However, if a service server fails unexpectedly, such as a power failure or network outage, it cannot communicate with the registry to remove itself from the service list. In this case, the client will continue to send requests to the faulty server, resulting in errors. To mitigate this, we need a mechanism for managing service status.

How to Manage Service Status #

Regarding the problem I mentioned above, we generally have two approaches to solving it.

The first approach is active detection, and the method is as follows:

Your RPC service needs to open a port, and the registry center will detect these ports at regular intervals (e.g. every 30 seconds) to check if they are available. If a port is available, the service is considered normal; otherwise, it is considered unavailable, and the registry center can remove the service from the list.

Weibo’s early registry center used this method, but two problems arose later, forcing us to make changes.

The first problem is that all RPC service endpoints need to open a unified port for the registry center to detect. At that time, there was no containerization, and multiple services would be deployed on a physical machine. The port you need to open may already be occupied, resulting in RPC service startup failures.

Another problem is that if there are many instances of RPC service endpoints deployed, the cost of detecting each endpoint will be high, and the detection time will also be long. Therefore, when a service becomes unavailable, it may take some time for the registry center to detect it.

Therefore, we later transformed it into a heartbeat mode.

This is also provided by most registry centers to check if the connected RPC service endpoints are alive, such as Eureka and ZooKeeper. From my perspective, this heartbeat mechanism can be implemented as follows:

For each connected RPC service node, the registry center records the most recent renewal time. After the RPC service node is started and registered with the registry center, it sends a heartbeat packet to the registry center at regular intervals (e.g. every 30 seconds). After receiving the heartbeat packet, the registry center updates the most recent renewal time of this node. Then, the registry center starts a timer to periodically check the difference between the current time and the most recent renewal time of the node. If it reaches a threshold (e.g. 90 seconds), the service node is considered unavailable.

In practical use, the heartbeat mechanism is more widely applicable than the active detection mechanism. If your service also needs to check if it is alive, you can consider using the heartbeat mechanism to detect it.

Moving on, with the heartbeat mechanism, the registry center can manage the status of registered service nodes, making it the most important component of your overall service. If it has problems or bugs in the code, it may likely cause the entire cluster to fail. Let me give you a real example.

In a previous project of mine, the system was deployed in a “hybrid cloud” manner, where some nodes were deployed in a self-built data center and some in cloud servers. Each data center had its own registry center, and each registry center stored the data of all nodes.

This self-built registry center used Redis as the ultimate storage, and the registry centers in the self-built data center and cloud servers shared the same set of Redis storage resources. Since the “hybrid cloud” was still in the testing phase, all traffic was still in the self-built data center, and the dedicated bandwidth between the self-built data center and the cloud servers was relatively small. The deployment structure is as follows:

During testing, the system ran stably. However, one day at 5 o’clock in the morning, I suddenly found that all service nodes had been removed. Clients failed to obtain the address list of server nodes, resulting in the entire service being down. After investigation, I found that the registry center deployed on the cloud server had actually deleted all service nodes! After further investigation, it turned out that there was a bug in the self-built registry center. In normal circumstances, whether it is a self-built data center or a service node on a cloud server, they will register their address information with the respective registration center of each data center and send heartbeats. The address information and the most recent lease time of the service are stored in the Redis master database. The registration centers of each data center will read from the respective slave database to obtain the most recent lease time, in order to determine the validity of the service nodes.

The data synchronization between Redis master and slave is transmitted through a dedicated line. Before the occurrence of a fault, the bandwidth of the dedicated line was fully occupied, leading to delays in data synchronization between the master and slave. As a result, the most recent lease time stored in the Redis slave database on the cloud server was not updated in a timely manner. As the delay in data synchronization between the master and slave became more severe, the registration center deployed on the cloud server eventually discovered that the difference between the current time and the most recent lease time exceeded the removal threshold. Therefore, all nodes were removed, leading to a failure.

In light of this painful lesson, we have added protection strategies to the registration center: if the number of removed nodes reaches 40% of the total number of service cluster nodes, the removal of service nodes will be stopped. In addition, alerts will be sent to the development and operation teams of the service (this threshold percentage can be adjusted to provide flexibility).

To my knowledge, Eureka also adopts a similar strategy to avoid excessive removal of service nodes, which can cause the service cluster to be unable to handle the traffic. If you are using distributed consensus components like ZooKeeper or ETCD without protection strategies, you can consider implementing protection logic on the client side, such as ignoring change notifications in the RPC client when the number of removed nodes exceeds a certain proportion. You can implement this based on your specific circumstances.

Furthermore, in actual projects, we have also encountered another important issue with registration centers called “notification storms.” Think about it, how many push messages will be generated when one node of a service changes? If your service has 100 callers and 100 nodes, changing one node will result in 100 * 100 = 10,000 data entries being pushed. If multiple service clusters go online or experience fluctuations at the same time, the registration center will push even more messages, severely consuming machine bandwidth resources. This is what I call a “notification storm.” So how do we solve this problem? You can consider the following approaches:

First, control the scale of the service cluster managed by a group of registration centers. There is no unified standard for the specific limitation, and you need to consider your business and the registration center options. The main criteria to consider is the peak bandwidth of the registration center servers.

Secondly, you can solve this by scaling up the number of registration center nodes.

Thirdly, you can standardize the usage of the registration center. If it is only a change in a specific node, you only need to notify that node of the change information.

Finally, if the registration center is self-built, you can add some protection strategies, such as stopping change notifications when the volume of messages reaches a certain threshold.

In fact, service registration and discovery, at its core, is one aspect of service governance. Service governance is essentially service management, which involves solving complex problems when multiple service nodes form a cluster. To help you understand, let me make a simple analogy.

You can think of a cluster as a mini city, the roads as the services that make up the cluster, and the vehicles travelling on the roads as the traffic. Service governance is the management of the entire road network in the city.

If you build a new street (i.e., start a new service node), you need to notify all vehicles (i.e., traffic) that there is a new road to take. If you close a street, you also need to notify all vehicles not to use that road. This is the essence of service registration and discovery.

We install monitoring on the roads, tracking the flow of traffic on each road. This is service monitoring.

If a road becomes congested or requires maintenance, it needs to be temporarily closed. The city then coordinates the vehicles to divert traffic onto less congested roads. This is circuit breaking and traffic diversion.

The roads crisscross the city. If there is congestion on one road but it is clear from end-to-end, it means the problem is not on that road. In such cases, we need to investigate where the problem is on the entire road network. This is distributed tracing.

Different roads have different amounts of traffic. To manage this, we need police officers to direct the vehicles and tell them which road is quicker at a given time. This is load balancing.

These issues will be covered in more targeted lessons in the future.

Summary of the Course #

In this lesson, I introduced how the registry center in a microservices architecture implements service registration and discovery, and some pitfalls encountered during implementation. In addition, I also explained the meaning of service governance and some technical points we will discuss later. In this lesson, I want to highlight the following key points:

The registry center allows us to dynamically change the node information of RPC services, which is significant for dynamic scaling, quick recovery from failures, and graceful shutdown of services.
Heartbeat mechanism is a common way to probe service status, which can also be considered in practical projects.
We need to provide some protection strategies for the nodes managed by the registry center to avoid service unavailability caused by excessive node removal.

You can see that although the registry center is a simple and easy-to-understand distributed component, it plays a crucial role in the overall architecture and should not be ignored. In addition, its design scheme contains some system design techniques, such as the service status detection mentioned above and the graceful shutdown method, understanding the principles of the registry center will provide you with some ideas for your future development work.