09 Health Check if This Node Is Down, Why Still Crazily Send Requests

09 Health Check - If this node is down, why still crazily send requests #

Hello, I am He Xiaofeng. In the previous lecture, we introduced the challenges of “service discovery” in large-scale clusters. The role of service discovery is to detect changes in the IP addresses of the cluster in real-time and establish a mapping between the interfaces and the IP addresses of the service cluster nodes. In practical large-scale cluster scenarios, we need to focus more on ensuring eventual consistency. To summarize, there is one key concept you need to remember: “combination of push and pull, with priority given to pull”. Continuing from yesterday’s content, let’s talk about health checks in RPC.

Because we have a cluster, the RPC framework selects a specific IP address based on routing and load balancing algorithms before sending each request. In order to ensure the success of the request, we need to ensure that the connection corresponding to the selected IP is healthy, which I believe you understand.

However, as you know, the network conditions between the caller and the service cluster nodes are constantly changing. There may be intermittent disconnections or network device failures. So how do we ensure that the selected connection is always available?

From my perspective, the ultimate solution is to enable the caller to real-time sense the changes in the nodes’ status, so that they can make the correct choices. This is similar to driving a car - a car has various components, and it is not possible for us to check their health before driving. Instead, there is a feedback mechanism. For example, if my headlights are broken today, the central console can give me a warning, and if my tire pressure is low tomorrow, the central console will also receive a warning. As the caller, I can be informed of the changes in the status of most key components in a car in real-time.

Now, how should we design this mechanism in the RPC framework? You can take a moment to think about the example of cars and how they handle this. Of course, returning to our RPC framework, this is referred to as health checks for services, using a more professional term. Today, we will discuss this topic in detail.

Problems Encountered #

Before further discussing service health checks, I would like to share with you a problem I encountered before.

One day, the leader of a business development team in our company rushed over to ask for my help in solving a problem. After carefully listening to his description, I realized that they found that the availability of a certain interface in the online business was not high. Basically, there would be a few failures out of ten calls.

After checking the specific monitoring data, we found that the problem only occurred when the request reached a specific machine, which means that there was a problem with one of the machines in the cluster. So to quickly solve the current problem, I suggested that they take this “problematic machine” offline.

However, for me, the problem was not over. I started to ponder further: “When the interface is called on a specific machine and it cannot respond in a timely manner, why does the RPC framework still continue to send requests to this problematic machine? The RPC framework will send the request to this machine, which means that from the perspective of the caller, it does not feel that this server has a problem.”

Just like a detective solving a case, in order to further understand the truth of the matter, I checked the monitoring and logs at the time of the incident and found several clues at the scene:

Through the logs, I found that the requests did indeed keep reaching this problematic machine because I saw many timeout exception messages in the logs.
From the monitoring, we can see that there were still some successful requests to this machine, which means that the network connection between the caller and the service at that time was not disconnected. Because if the connection is disconnected, the RPC framework will mark this node as “unhealthy” and it will not be selected to handle business requests.
Delving deeper into the exception logs, I found intermittent failures in the periodic heartbeat between the caller and the target machine.
From the monitoring of the target machine, it can be seen that the network metrics of this machine were abnormal, and at the time of the problem, the TCP retransmission count was more than 10 times higher than normal.

With the analysis of these four clues, I can basically draw the following conclusion: The problematic server experienced network failures during certain time periods, but it was still able to handle some requests. In other words, it was in a semi-dead state. However, it had not completely “died” yet and still had a heartbeat. As a result, the caller thought it was still normal, so it was not promptly removed from the list of healthy nodes.

By now, you should also understand that initially, in order to quickly resolve the issue, we manually took the problematic machine offline. After digging deeper, we found that the bigger problem was with our service detection mechanism. Some services were actually critically ill, but we thought they were just having a cold.

Next, let’s take a look at the core logic of service detection.

Logic of Health Detection #

Just now we mentioned the heartbeat mechanism, and I guess you might wonder why we complicate things with heartbeats. When the service provider goes offline, we will definitely receive a notification event indicating that the connection has been disconnected in normal circumstances. In this event, we can directly add the processing logic, right? Yes, in the example of the car we mentioned earlier, that’s how the detection is done. But it doesn’t work here because the health of the application includes not only the TCP connection status, but also whether the application itself is alive. In many cases, the TCP connection may not be disconnected, but the application may already be “dead”.

Therefore, the commonly used method in the industry is to use the heartbeat mechanism. The heartbeat mechanism is not complicated, actually. It is just that the service consumer asks the service provider every certain period of time, “Hey buddy, are you okay?” and the service provider honestly tells the consumer its current status.

Combining with what we discussed earlier, it’s not difficult for you to think that the status of the service provider generally has three situations: one is that it is healthy, the other is that it is sick, and the third is that it does not respond. The three states can be corresponding to professional terms as follows:

Healthy status: Successfully establish the connection and the heartbeat test is continuously successful;
Sub-healthy status: Successfully establish the connection, but the heartbeat requests continuously fail;
Death status: Failed to establish the connection.

The status of the nodes is not fixed and unchangeable. It will change dynamically based on the results of the heartbeat or reconnection. The specific state transition diagram is as follows:

Here, you can pay attention to the transition arrows between several states, and I will explain further. First of all, at the initialization stage, if the connection is successfully established, it is in a healthy state; otherwise, it is in a dead state. There is no intermediate state like sub-healthy. Next, if a node in the healthy state fails to respond to heartbeat requests several times in a row, it will be marked as sub-healthy status. In other words, the consumer will think that it is sick.

After getting sick (sub-healthy status), if a node can respond to heartbeat requests normally several times in a row, it can transition back to the healthy state, indicating that it has recovered. If it cannot recover from the illness, it will be determined as a dead node, and after death, follow-up actions, such as closing the connection, need to be taken.

Of course, death is not real death; there is a chance of revival. If at some point, a dead node can reconnect successfully, it can be re-marked as a healthy state.

This is the whole idea of the state transition of the nodes. You don’t need to memorize it, as it is quite simple. Except for the inability to be revived, everything else is the same as our human state. After the service consumer understands the status of the node through the heartbeat mechanism, it can prioritize selecting a node from the healthy list when sending requests. Of course, if the healthy list is empty, in order to improve availability, it can also try to select a node from the sub-healthy list. This is the specific strategy.

Specific Solutions #

Now that we understand the logic of service health checks, let’s go back to the scenario I described at the beginning and see how we can optimize it. Now that you understand that a node transitions from a healthy state to a sub-healthy state only when the number of consecutive heartbeat failures reaches a certain threshold, such as 3 times (depending on your specific configuration).

In our scenario, the heartbeat logs of the nodes only intermittently fail, meaning they are intermittently good and bad. In this case, the failure count never reaches the threshold, so the calling party will think that the node is just “sick” temporarily and will soon recover. How do we solve this? I suggest you pause for a moment and think about it.

Perhaps you will blurt out that we should adjust the configuration and lower the threshold. Yes, that is the fastest solution, but I have to say that it is just a temporary fix. Firstly, as mentioned earlier, the network conditions between the calling party and the service node are constantly changing, and network fluctuations can lead to misjudgment. Secondly, when the load is high, the server may not have enough time to handle heartbeat requests, and because the heartbeat time is very short, the calling party may trigger consecutive heartbeat failures quickly, leading to disconnection.

Let’s go back to the root of the problem, which is that the service node has intermittent network issues causing the heartbeat to fail intermittently. Currently, our node status judgment is based on a single dimension, which is the heartbeat check. Could we add the dimension of business requests?

At least, that’s the direction I followed to solve the problem. However, right after that, I discovered new troubles:

The calling frequency of each interface may vary; some interfaces may be called hundreds of times within a second, while others may only be called once every half an hour. Therefore, we cannot simply use the total failure count as the judgment condition.
The response time of the service interfaces also differs; some interfaces may take 1ms, while others may take 10s. Therefore, we cannot use TPS (transactions per second) as the judgment condition either.

After discussing with my colleagues, we found the breakthrough point: availability. This should be relatively perfect. The calculation of availability is the percentage of successful interface calls within a certain time window (successful calls/total calls). When the availability rate falls below a certain percentage, we consider the node to have issues and move it to the sub-healthy list. This approach takes into account both high and low-frequency interface calls, as well as the different response times of interfaces.

Summary #

In this lecture, I shared with you a core functionality of the RPC framework - health checks. It helps us filter out problematic nodes from the connection list, thus avoiding selecting problematic nodes when sending requests and affecting business operations. However, when designing a health check solution, we cannot simply consider whether the TCP connection is healthy or if the heartbeat is normal. Since the purpose of health checks is to ensure “business continuity,” we can consider factors such as the availability of business requests in the design, maximizing the availability of RPC interfaces.

Under normal circumstances, we would send a heartbeat request approximately every 30 seconds. This interval is generally not too short, as it would impose significant pressure on the service nodes. However, if the interval is too long, we cannot promptly remove nodes with problems.

In addition to using scheduled “health checks” in the RPC framework, “heartbeat probing” mechanisms are also used in the design of other distributed systems.

For example, in the design of application monitoring systems, it is necessary to raise alarms for unhealthy application instances so that operations personnel can handle them promptly. Similar to our RPC example, in this scenario, we cannot simply rely on port connectivity to determine the application’s status, as the application may remain unresponsive even when the port is connected.

So, what other ways can we handle the situation where an application becomes unresponsive? We can have each application instance provide a “health check” URL. A monitoring program periodically constructs HTTP requests to access this URL and determines the health based on the response. This prevents misjudgments regarding unresponsive states. You see, this is the heartbeat mechanism we discussed earlier, right?

However, in this case, I still have a trick up my sleeve. Does adding a heartbeat mechanism completely solve the problem? Certainly not, because there may still be network failures between the monitoring program’s machine and the target machine. If a failure occurs, won’t it lead to misjudgment? You might think that the target machine is ill or down, but it could be the heartbeat device that malfunctioned…

Based on my experience, there is a way to reduce the probability of misjudgment, and that is to deploy the monitoring program on multiple machines, distributed in different racks or even different data centers. Since the probability of concurrent network failures is very low, as long as any instance of the monitoring program can access the target machine successfully, it indicates that the target machine is functioning properly.

Reflections after the lesson #

I would like to know your thoughts after watching today’s presentation. Do you often encounter scenarios involving health checks in your work? Please feel free to share with me in the comment section what you have been doing or criticize my solution. I will provide feedback as soon as possible.

Of course, you are also welcome to leave a message and share your thoughts and questions with me. I hope you can share what you have learned today with your friends and invite them to join the discussion. See you in the next class!