20 Detailed Analysis the Application of the Timing Wheel in Rpc

20 Detailed Analysis - The application of the timing wheel in RPC #

Hello, I’m He Xiaofeng. In the previous lecture, we learned how to quickly locate problems in a distributed environment. Let’s briefly review the key points. In a distributed environment, both the RPC framework itself and the business logic implementation of the service provider should properly encapsulate exceptions, allowing users to quickly locate problems based on exceptions. In complex distributed systems with multiple departments and complicated dependency relationships, we can also use distributed tracing systems to quickly locate problems.

Now, let’s switch to today’s topic and take a look at the application of the clock wheel in RPC.

What problems do scheduled tasks bring? #

Before discussing the timer wheel, let’s talk about scheduled tasks. I believe that in the process of development, you will often encounter scenarios where scheduled tasks are needed. Many places in RPC frameworks also use scheduled tasks. Let’s take the timeout handling logic of the calling client as an example, and see how RPC frameworks handle timeout requests.

Let’s review [Lesson 17] first. When I explained Future, I mentioned that whether it’s a synchronous or asynchronous call, the client internally always performs asynchronously. Before sending a message to the server, the client creates a Future and stores the mapping between the message identifier and this Future. When the server receives and processes the message, it sends a response message to the client. After receiving the message, the client finds this Future based on the unique identifier in the message and injects the result into this Future.

So, what if the server does not respond to the message in a timely manner? How should the client handle the timeout request?

Yes, we can utilize scheduled tasks. Every time a Future is created, we record the creation time of this Future and its timeout time. We also have a scheduled task to check if the Future reaches the timeout time and has not been processed yet. In that case, we execute the timeout logic for this Future.

How can we implement scheduled tasks?

There is one implementation approach, which is also the simplest one. For each created Future, we start a thread and then sleep. When the timeout time is reached, we trigger the logic to handle the timeout request.

This approach is indeed simple and can be used in some scenarios, but the drawbacks are also obvious. Just like the example I mentioned earlier about Future timeout handling, if we face high-concurrency requests, with tens of thousands of requests being sent per second on a single machine, and the timeout time is set to 5 seconds, how many threads do we need to create for executing the timeout tasks? Over 100,000 threads, this number is really frightening.

Don’t worry, we have another implementation approach. We can use one thread to handle all the scheduled tasks, using the same example of Future timeout handling. Suppose we start a thread that scans all the tasks for handling Future timeout every 100 milliseconds. When we find a Future that has timed out, we execute the task and perform the timeout logic for this Future.

This approach is the most commonly used one, and it solves the problem of having too many threads in the first approach. However, it also has obvious drawbacks.

When facing high-concurrency requests, how many scheduled tasks does the scanning thread need to scan every 100 milliseconds? If the client happens to send 10,000 requests within 1 second, and these 10,000 requests will not time out until 5 seconds later, then the scanning thread will repeatedly scan and iterate through these 10,000 tasks during this 5-second period, needing to scan more than 40 times (scanning once every 100 milliseconds, nearly 50 times within 5 seconds), which wastes a lot of CPU resources.

The problem that scheduled tasks bring is that they make the CPU perform many additional polling and iteration operations, wasting CPU resources. This phenomenon becomes more apparent when there are many scheduled tasks, especially in high-demand scenarios.

What is a timer wheel? #

This problem is not difficult to solve. We just need to find a way to reduce unnecessary scanning operations. For example, if I have a batch of tasks scheduled to execute after 5 seconds, I can start scanning this batch of tasks after 4.9 seconds, which greatly saves CPU resources. This is where the timer wheel mechanism comes in.

Let’s first take a look at the clocks we use in our daily lives.

Clock

It’s very familiar, right? A clock has hour, minute, and second hands. After the second hand completes 60 ticks, the minute hand moves once. After the minute hand completes 60 ticks, the hour hand moves once.

The implementation principle of a timer wheel is based on this concept of clock movement.

Clock Wheel

In the timer wheel mechanism, there are time slots and clock wheels. A time slot is equivalent to a tick on the clock, and a clock wheel is equivalent to a cycle of movement of the second hand or minute hand. We place each task in the corresponding time slot.

The operation mechanism of the timer wheel is similar to the movement of a clock in real life. Every fixed unit of time, it jumps from one time slot to the next, just like the movement of a second hand. The timer wheel can have multiple layers, and the unit time of each slot in the next layer is equal to the time period of the entire cycle of the current wheel. For example, 1 minute is equal to 60 seconds. When the timer wheel completes the movement of all slots in one cycle, it takes out a slot from the next layer of the timer wheel, redistributes the tasks to the current wheel, and the current wheel starts moving from the 0th slot again, just like the start of the first second of the next minute.

To help you understand the operation mechanism of the timer wheel, let’s simulate a scenario. Let’s take a look at this example together.

Let’s assume that our timer wheel has 10 slots, and the cycle of the timer wheel is 1 second. So each slot has a unit time of 100 milliseconds. The next layer of the timer wheel has a cycle of 10 seconds and each slot has a unit time of 1 second. Currently, the timer wheel has just been initialized and is at the 0th jump, currently at the 0th slot.

Scenario

Alright, now we have 3 tasks: Task A (to be executed after 90 milliseconds), Task B (to be executed after 610 milliseconds), and Task C (to be executed after 1 second and 610 milliseconds). We add these 3 tasks to the timer wheel. Task A is placed in the 0th slot, Task B is placed in the 6th slot, and Task C is placed in the 1st slot of the next layer of the timer wheel, as shown in the image below.

Task Placement

When Task A is added to the timer wheel, it is executed immediately because it is placed in the 0th slot, and the current wheel is at the 0th slot (although it hasn’t started moving, its state is at the 0th jump). After 600 milliseconds, the timer wheel has moved 6 jumps, and the current slot is the 6th slot. All tasks in the 6th slot are taken out and executed. After 1 second, the 9th jump of the current timer wheel is completed, and it starts the 0th jump again. At this time, the next layer of the timer wheel moves from the 0th jump to the 1st jump, and Task C is taken out from the 1st slot and placed in the 6th slot of the current timer wheel. After 1 second and 600 milliseconds, Task C is executed.

Task Execution

After watching this scenario, I believe you have a good understanding of the timer wheel mechanism. In this example, the scanning period of the timer wheel is still 100 milliseconds, but the tasks are not excessively scanned. It perfectly solves the problem of CPU resource waste.

This mechanism is not difficult to understand, but implementing it can be challenging and there are many issues to consider. We won’t show the specific code implementation here, as it is a separate and extensive topic. If you are interested, you can search for relevant source code and try to implement it yourself. If you encounter any difficulties, we can discuss them in the comments section.

Application of the Time Wheel in RPC #

From the previous explanation of the Time Wheel, I believe you can see that it is used to execute scheduled tasks. It can be said that whenever we need to deal with timing-related operations in RPC frameworks, we can use the Time Wheel.

So in which functionalities of the RPC framework can we use it?

As I mentioned earlier, we can use the Time Wheel for handling timeouts of client requests. For every request we send, we create a timer task to handle the timeout and put it into the Time Wheel. In high-concurrency and high-traffic scenarios, the Time Wheel only checks the tasks in one time slot at a time, which greatly saves CPU resources.

Timeouts during startup can also be handled using the Time Wheel for both the client and the server. Taking the client as an example, if we want our application to start quickly, such as within 1 minute, we can create a timer task to handle the startup timeout and put it into the Time Wheel when the client starts.

In addition, can you think of any other places in the RPC framework where the Time Wheel can be applied? Another example is periodic heartbeat. The RPC framework can send periodic heartbeat messages from the client to the server to maintain the connection status. We can encapsulate the logic of heartbeat into a heartbeat task and put it into the Time Wheel.

At this point, you may have a question. Heartbeats need to be executed repeatedly, but tasks in the Time Wheel are removed once they are executed. How do we handle this kind of periodic tasks? In the execution logic of the timer task, we can reset the execution time of the task and put it back into the Time Wheel.

Summary #

Today we mainly explained the mechanism of the clock wheel and its application in the RPC framework.

This mechanism effectively solves the problem of creating too many threads in timed tasks because each task creates a thread, as well as the problem of a single thread scanning all timed tasks and wasting CPU by performing additional polling and traversal operations.

The implementation mechanism of the clock wheel simulates a clock in real life by placing each timed task in the corresponding time slot. This reduces the extra traversal operations of other timed tasks when scanning tasks.

In the use of the time wheel, there are some additional considerations:

The shorter the unit time of a time slot, the more accurate the timing of task triggering. For example, if the unit time of a time slot is 10 milliseconds, the timing error of executing timed tasks is within 10 milliseconds. If it is 100 milliseconds, the error is within 100 milliseconds.
The more time slots in the time wheel, the lower the probability of a task being repeatedly scanned, because only tasks in multiple layers of the clock wheel will be repeatedly scanned. For example, if a time wheel has 1000 time slots and the unit time of a time slot is 10 milliseconds, the unit time of a time slot in the next layer of the time wheel will be 10 seconds. Timed tasks exceeding 10 seconds will be placed in the next layer of the time wheel. That means only timed tasks exceeding 10 seconds will be scanned and traversed twice. However, if there are only 10 time slots, tasks exceeding 100 milliseconds will be scanned and traversed twice.

Combining these characteristics, we can customize the cycle and number of time slots for the clock wheel based on the specific business scenario.

In the RPC framework, whenever timed tasks are involved, we can apply the clock wheel, such as the timeout handling of the calling end, the startup timeout of the calling end and the service end, and the periodic heartbeat, and so on.

Reflection #

In addition to the examples I mentioned in the RPC framework, what other implementations do you know that can be applied to the clock wheel?

Feel free to leave a comment and share your answer with me. You are also welcome to share this article with your friends and invite them to join the learning. See you in the next lesson!