13 Graceful Shutdown How to Avoid Business Losses Caused by Service Shutdown

13 Graceful Shutdown - How to avoid business losses caused by service shutdown #

Hello, I’m He Xiaofeng. In the previous lecture, we talked about “exception retry”. In summary, exception retry is a means to ensure the availability of interfaces to the greatest extent possible. However, this strategy can only be used on idempotent interfaces, otherwise retrying may cause data corruption in the application system.

Continuing from yesterday’s content, today we will talk about the shutdown process in RPC.

Why is there a problem with shutdown? #

We know that as the “monolithic application” becomes more complex, we usually split it into separate systems, which is the popular microservices architecture. After splitting the services, there is a need for collaboration, so the RPC framework emerges to solve the communication problem between subsystems.

Let me ask you a very basic question. Why do you think we need to split the system? From my perspective, if I have to give one reason, I think that after splitting, we can iterate the business more conveniently and quickly. So here comes the problem: iterating the business more quickly means that I will frequently update the application system and occasionally have to restart the server, right?

Now, specifically in our RPC system, you need to consider how to ensure that the calling system does not encounter any problems during the server restart process.

To explain this, let’s first briefly describe the general process of going live: when a service provider wants to go live, it is usually done through a deployment system that performs instance restarts. During this process, the service provider’s team does not inform the calling party in advance which machines they need to operate, so the calling party cannot cut off the traffic in advance. Moreover, the calling party cannot predict which machines the service provider will restart and go live. Therefore, the load balancing mechanism may select the machines that are being restarted, which can result in the requests being sent to machines that are in the process of restarting, causing the calling party to not receive the correct response results.

During service restarts, the calling party may encounter the following situations:

Before sending the request, the target service has already gone offline. For the calling party, the connection to the target node will be disconnected, and the calling party can immediately perceive this and remove the node from its healthy node list, so it won’t be selected by the load balancing.
When the calling party sends the request, the target service is in the process of shutting down, but the calling party is unaware of it, and the connection between them is not disconnected. Therefore, this node will still exist in the healthy node list, and there is a certain probability that it will be selected by the load balancing.

Closing Process #

Of course, there is still the situation where the target service is still starting up. I will discuss how to start gracefully in detail in the next lecture, which is also important. Today, we will focus on how to avoid damage to the calling side in the second situation in RPC.

At this point, you may think, can’t I just remove the machine to be taken offline from the “healthy list” maintained by the caller before restarting the service machine, so that the load balancing cannot select this node? You are absolutely right, but how exactly is this “some way” accomplished?

The least efficient method is manual notification to callers, asking them to manually remove the machine to be taken offline. This method is very primitive and direct. But it is too cumbersome for the process of bringing up the provider, and it is a waste of time and meaningless to notify all teams calling my interface every time it is brought up. Obviously, it is not acceptable.

At this point, you may also think that RPC has service discovery, right? Isn’t it used to “real-time” perceive the status of the provider? Can’t you notify the registry before the provider shuts down, and then tell the caller through the registry to remove the node? The closing process is shown in the following figure:

With this method, you can achieve an automated approach without relying on “human intervention”, but can this ensure seamless online and offline operations?

As shown in the figure above, the entire closing process relies on two RPC calls: one is the provider notifying the registry to perform the offline operation, and the other is the registry notifying the service caller to remove the node. The registry notifies the service caller asynchronously. As we mentioned in the lecture on “service discovery”, in large-scale clusters, service discovery only guarantees eventual consistency and does not guarantee real-time performance. Therefore, when the registry receives the provider’s offline signal, it cannot guarantee that it can push this machine to be taken offline to all callers. Therefore, using service discovery alone does not guarantee seamless application shutdown.

Since we cannot rely heavily on “service discovery” to notify callers to take machines offline, can the provider notify them itself? Because in RPC, the connection between the caller and the provider is a long connection, we can maintain a set of caller connections in the provider’s application memory. When the service needs to be shut down, we can notify each caller to take this machine offline. In this way, the entire call chain becomes shorter, and for each caller, there is only one RPC, which ensures a high success rate for the call. In most cases, this is not a problem. We used to implement it this way as well, but we found that occasional failures due to the provider coming online still occur in the production environment.

So where exactly is the problem? I analyzed the request logs of the callers and the logs of receiving the shutdown notifications and found a clue: the time points when the problematic requests occurred were very close to the time points when the provider received the shutdown notification, less than 1ms earlier than the time of the notification. If we also consider the network transmission time, it means that when the provider receives the request, it should be processing the shutdown logic. This indicates that when the provider shuts down, it does not handle the new requests received after the shutdown correctly.

Graceful Shutdown #

Once we understand the root cause, the problem can be easily solved. Since the service provider has started the shutdown process, many objects may have already been destroyed. After the shutdown, any requests received cannot be guaranteed to be processed. Therefore, we can set up a “blocker” during shutdown, which informs the caller that the shutdown process has started and the request cannot be processed.

If you frequently go to the bank, you may be familiar with this process. When bank tellers are changing shifts or have other matters to attend to, they will put up a sign saying “This Counter is Closed” in front of the window. Even if the people queuing at that counter do not want to, they have to move to another counter to get their business done, because the teller will finish processing the current transaction before officially closing the counter.

Based on this idea, we can handle it as follows: when the service provider is shutting down and receives new requests, it directly returns a specific exception (e.g. ShutdownException) to the caller. This exception tells the caller that “I have received this request, but I am in the process of shutting down and cannot handle it.” After receiving this exception response, the RPC framework removes this node from the healthy node list and automatically retries the request to other nodes. Since this request has not been handled by the service provider, it can be safely retried on other nodes, ensuring no impact on the business.

However, relying on passive calls will make the overall shutdown process somewhat prolonged. Because some callers may not have any business requests at that moment, they cannot be notified in a timely manner. Therefore, we can add an active notification process to ensure real-time communication and avoid notification failures.

At this point, I know you must be wondering how to capture the shutdown event.

In my experience, we can capture the process signals of the operating system to obtain the shutdown event. In the Java language, you can use the Runtime.addShutdownHook method to register a shutdown hook. When the RPC starts, we register the shutdown hook in advance and add two handlers. One is responsible for setting the shutdown flag, and the other is responsible for safely closing the service object. The service object will notify the caller to bring the offline node when it is being closed. At the same time, we need to add a blocker handler in our call chain. When a new request arrives, it checks the shutdown flag and throws a specific exception if the service provider is shutting down.

At this point, it seems that the problem has been well solved. But careful students may still have questions: will the requests being processed during shutdown be affected?

If the process ends too quickly, these requests may not have time to be responded to, and the caller will also throw an exception. In order to complete the ongoing requests as much as possible, we need to identify them first. This is similar to the parking information displayed on the parking lot sign, which shows the number of available parking spaces. If you observe carefully, you will find that every time a car enters, the number of available spaces decreases, and every time a car leaves, the number increases. We can also use this principle and add a reference counter to the service object. Increase the counter before processing a request, and decrease it after completing the request. With this counter, we can quickly determine if there are any ongoing requests.

During the shutdown process, the service object will reject new requests and wait for all ongoing requests to finish, based on the reference counter, before closing. However, considering that some business requests may take a long time to process or may be hung, to avoid waiting indefinitely and prevent the application from exiting normally, we can add a timeout control in the entire shutdown hook. If the specified time is exceeded and the requests are not finished, the application will be forcibly exited. I recommend setting the timeout to 10 seconds, which should ensure that most requests are processed. The entire process is shown in the following diagram.

Summary #

In RPC, although shutdown may not seem to be part of the main process, if we do not handle it well, it may lead to abnormal business operations for the calling party, requiring us to add additional operations and maintenance work. A good shutdown process ensures a smooth transition for business implementations using our framework, without having to worry about issues caused by restarting.

In fact, the concept of “graceful shutdown” is not only applicable in RPC, but also common in many frameworks, such as Tomcat, an application container framework we often use. When Tomcat is shut down, it shuts down layer by layer from the outer layer to the inner layer, ensuring that new requests are no longer accepted before processing the remaining requests received before shutdown.

Reflection after class #

Today I only talked about graceful shutdown. In fact, when it comes to application restart and deployment, the application startup process is also involved. So how can we achieve graceful startup and avoid dispatching requests to services that are not ready yet? Please take some time to think about it on your own, and I will explain it to you in detail in the next class.

Of course, I also welcome you to leave a comment and share your thoughts and questions with me. You can also share this article with your friends and invite them to join the learning. See you in the next class!