12 Exception Retry How to Safely and Reliably Retry Within a Set Time

12 Exception Retry - How to safely and reliably retry within a set time #

Hello, I am He Xiaofeng. In the previous lecture, I explained how to design adaptive load balancing in the RPC framework. The key point is for the invoking end to collect metric data of each node on the server side, calculate scores based on various metrics, and finally redirect more traffic to nodes with higher scores.

Today, let’s continue with the next topic and talk about the exception retry mechanism in the RPC framework.

Why do we need exception retry? #

Let’s consider a scenario. We make an RPC call to a remote service, such as a user login operation. We first validate the user’s username and password, and after successful validation, we retrieve the user’s basic information from the remote user service. However, during the retrieval of the user’s basic information, there may be a network issue, such as a temporary network interruption, causing our request to fail. In this case, we want the request to be executed successfully as much as possible. So, what should we do in our code?

We need to retry the RPC call. But how should we handle it in our code? Should we catch the exception and then make another call? This approach is obviously not very elegant. This is where we can consider using the retry mechanism provided by the RPC framework.

Retry Mechanism in RPC Framework #

So what is the retry mechanism in an RPC framework?

It’s actually quite simple to understand. When a request initiated by the client fails, the RPC framework itself can retry by resending the request. Users can choose whether or not to enable retries and set the number of retry attempts.

How is this mechanism implemented?

It’s still quite simple. Let’s think back to [Lesson 11]. Through this lesson, we learned that when the client initiates an RPC call, it goes through load balancing to select a node, and then it sends the request to that node. If the message sending fails or an exception message is received, we can catch the exception, trigger a retry, select another node through load balancing, and resend the request message. The number of retry attempts is recorded, and when it reaches the maximum number of attempts set by the user, a failure exception is returned to the client’s dynamic proxy. Otherwise, the retry continues.

The retry mechanism in an RPC framework is that the client captures the exception when it detects a request failure, triggers a retry, and only exceptions that meet the retry conditions can be retried, such as network timeout or network connection exceptions.

Now that we understand the retry mechanism in an RPC framework, what issues should users be aware of when using exception retries?

For example, in the scenario I mentioned earlier, if the network suddenly experiences some latency causing a request timeout, but at this time, the request information from the calling party has already been sent to either the service provider node or the service node of the service provider. Now, if the request information is successfully sent to the service node, does the node have to execute the business logic? Yes, it does.

Then, if a retry is attempted at this time, will the business logic be executed again? Yes, it will.

And if the business logic is not idempotent, such as an insert operation, will triggering a retry cause a problem? Yes, it will.

In summary, when using an RPC framework, we need to ensure that the business logic of the called service is idempotent in order to consider enabling the exception retry function of the RPC framework based on specific circumstances. You should pay special attention to this, as it is a common pitfall.

After the explanation above, I believe you have a clear understanding of the retry mechanism in an RPC framework, which is also the retry mechanism adopted by most RPC frameworks nowadays.

Now, having reached this point, do you think this mechanism is perfect? Have you considered the impact of consecutive retries on the request timeout? Let’s continue to consider such a scenario: Suppose the client’s request timeout is set to 5 seconds, and consecutive retries are attempted 3 times, each taking 2 seconds. In the end, the total time taken for this request is 6 seconds. In this case, is the timeout set by the client accurate?

How to reliably retry within a set time? #

As I mentioned earlier, continuous abnormal retries can lead to an unreliable situation, where continuous abnormal retries and each request processing time is longer will eventually result in requests taking too long to process, exceeding the user-set timeout.

The most direct way to solve this problem is to reset the timeout of the request after each retry.

When the client initiates an RPC request and an exception occurs during the request sending, triggering an exception retry, we can first check if the request has already timed out. If it has, we return a timeout exception directly. Otherwise, we reset the timeout of the request after retrying.

So, by solving the issue of timeout becoming invalid due to multiple abnormal retries, is this retry mechanism completely reliable?

Let’s consider further. When the client sets an exception retry policy and makes an RPC call, selecting a node via load balancing and sending the request message to that node, if the node fails to process the request due to high load pressure, the client triggers a retry and selects another node via load balancing, but happens to select the same problematic node again. In this case, does the effectiveness of the retry mechanism get affected?

Of course, it does. Therefore, we need to remove the node with problems that occurred before retrying when initiating any retry or selecting a node via load balancing to ensure the success rate of the retry.

Now let’s recap. Considering that the business logic must be idempotent, timeout needs to be reset, and nodes with problems must be removed, is there any room for optimization in this exception retry mechanism?

As I mentioned earlier, the RPC framework’s exception retry mechanism is triggered when a client sends a request and encounters an exception, but not all exceptions will trigger a retry. Only specific exceptions in the RPC framework, such as connection exceptions and timeout exceptions, will do so.

Exceptions thrown back to the client from the server’s business logic cannot be retried. Now, consider this situation: the server’s business logic throws an exception to the client, and it allows the client to make another call.

For example, let’s say the server’s business logic is an update operation on a piece of data in the database. If the update fails, it throws an exception indicating the failure, and the client can make another call to trigger a retry of the server’s update operation. In this case, although the client received an exception indicating the update failure, even though it is a business exception thrown by the server, it is still eligible for retry.

So, how can the retry mechanism of the RPC framework be optimized in this situation?

The RPC framework does not know which business exceptions can be retried. We can add a retry exception whitelist, where users can add exceptions that are allowed to be retried to this whitelist. When the client makes a call and configures an exception retry policy, and catches an exception, we can use this exception handler strategy. If this exception is allowed to be retried by the RPC framework or if its exception type exists in the retry exception whitelist, we allow retrying this request.

After investigating all possible issues, a reliable retry mechanism is formed, as shown in the following diagram:

Summary #

Today we explained the retry mechanism of the RPC framework and how to perform safe and reliable retries within a defined time.

When the request initiated by the caller fails, if an exception retry policy is configured, the RPC framework will catch the exception, evaluate it, and if it meets the criteria, perform a retry by resending the request.

During the retry process, in order to perform a safe and reliable retry within the defined time, before each retry, we need to check if the request has timed out. If it has, a timeout exception is returned directly. Otherwise, we need to reset the timeout for this request to prevent the processing time of this request from exceeding the user-configured timeout due to multiple retries, thus affecting the processing time of the business.

When initiating retries and selecting nodes for load balancing, we should exclude the node that had problems before the retry. This can improve the success rate of retries. Additionally, we allow users to configure a whitelist of retryable exceptions, making the exception retry feature of the RPC framework more user-friendly.

Furthermore, when using the retry mechanism of the RPC framework, we must ensure that the business logic of the called service is idempotent. This is crucial when considering whether to use retries.

After-class reflection #

Please take a moment to think about where exception retry occurs in the entire RPC call process?

Feel free to leave a message to share your answer, and also feel free to share this article with your friends and invite them to join the study. See you in the next class!