16 Asynchronous Mechanism How to Avoid Blocking in Single Threaded Model

16 Asynchronous Mechanism How to Avoid Blocking in Single-threaded Model #

One of the main reasons why Redis is widely used is its support for high-performance access. Because of this, we must pay attention to all factors that may affect the performance of Redis, such as command operations, system configuration, key mechanisms, hardware configuration, etc. We not only need to understand the specific mechanisms and try to avoid performance anomalies, but also need to be prepared in advance to deal with exceptions.

Therefore, starting from this lesson, I will introduce the potential factors that affect the performance of Redis in 5 aspects over 6 lessons, which are:

Blocking operations within Redis;
Impact of CPU cores and NUMA architecture;
Key system configuration of Redis;
Redis memory fragmentation;
Redis buffer.

In this lesson, let’s first learn about the blocking operations within Redis and the methods to deal with them.

In [Chapter 3], we learned that Redis’ network I/O and key-value read/write operations are performed by the main thread. So, if the operations executed on the main thread take too long, it will cause the main thread to block. However, Redis has both key-value operations for client requests, reliability guaranteeing persistence operations, data synchronization operations during master-slave replication, and so on. With so many operations, which ones will cause blocking?

Don’t worry, next, I will categorize and analyze these operations for you to identify blocking operations.

What are the blocking points of Redis instances? #

When a Redis instance is running, it needs to interact with many objects, and these different interactions involve different operations. Let’s take a look at the objects involved in the interaction with Redis instances and the operations that occur during the interaction.

Client: Network IO, key-value CRUD operations, database operations;
Disk: Generating RDB snapshots, recording AOF logs, AOF log rewriting;
Master-Slave Nodes: Master generates and transfers RDB files, slave receives RDB files, clears database, loads RDB files;
Sharded Cluster Instances: Transfer hash slot information to other instances, data migration.

To help you understand, let me draw a diagram showing the relationship between these 4 types of interacting objects and the specific operations:

Next, let’s analyze one by one, in these interacting objects, which operations can cause blocking.

1. Blocking points when interacting with clients

Network IO can sometimes be slow, but Redis uses IO multiplexing mechanism to avoid the main thread being in a state of waiting for network connections or requests to come. Therefore, network IO is not a factor that causes Redis to block.

Key-value CRUD operations are the main part of the interaction between Redis and clients, and they are the main tasks executed by the Redis main thread. Therefore, operations with high complexity in CRUD operations will definitely block Redis.

So, how to determine if the complexity of an operation is high? There is a basic criterion here, which is to see if the complexity of the operation is O(N).

In Redis, the complexity of set operations is usually O(N), so we should pay attention to it when using it. For example, set operations such as HGETALL and SMEMBERS for full element query, as well as set aggregation and statistical operations such as intersection, union, and difference. These operations can be considered as the first blocking point of Redis: full-element query and aggregation operations.

In addition to this, there is also a potential blocking risk in the deletion operation of sets itself. You might think that the deletion operation is simple, just delete the data directly, why would it block the main thread?

In fact, the essence of the deletion operation is to release the memory space occupied by the key-value pair. Don’t underestimate the process of releasing memory. Releasing memory is just the first step. In order to manage memory space more efficiently, when the application program releases memory, the operating system needs to insert the released memory block into a linked list of free memory blocks for subsequent management and reallocation. This process itself takes time, and will block the application program that releases memory, so if a large amount of memory is released at once, the time for operating the linked list of free memory blocks will increase, resulting in blocking the Redis main thread.

So, when will a large amount of memory be released? It is when deleting a large amount of key-value pair data, the most typical case is deleting a set with a large number of elements, also known as bigkey deletion. In order to give you an intuitive impression of the deletion performance of bigkeys, I tested the time consumed during the deletion operation of sets with different numbers of elements, as shown in the table below:

From this table, we can draw three conclusions:

When the number of elements increases from 100,000 to 1 million, the growth rate of deletion time for the 4 major set types increases from 5 times to nearly 20 times;
The larger the set elements, the longer the time it takes to delete;
When deleting a set with 1 million elements, the maximum deletion time has reached 1.98s (Hash type). Redis response time is generally in the microsecond level, so an operation reaching nearly 2s will inevitably block the main thread. Based on the analysis just now, it is clear that deleting big keys is the second bottleneck of Redis. Deleting keys has a significant negative impact on the performance of Redis instances and is often overlooked in actual business development, so it must be taken seriously.

Since frequent deletion of key-value pairs is a potential bottleneck, in the context of database-level operations in Redis, clearing the database (e.g., FLUSHDB and FLUSHALL operations) is inevitably another potential blocking risk because it involves deleting and releasing all key-value pairs. Therefore, this is the third blocking point of Redis: clearing the database.

2. Blocking points when interacting with disk

I singled out the interaction between Redis and the disk as a separate category because disk I/O is generally time-consuming and resource-intensive, and therefore needs to be paid special attention to.

Fortunately, Redis developers have long realized that disk I/O can cause blocking, so they designed Redis to generate RDB snapshot files and perform AOF log rewriting operations using subprocesses. As a result, these two operations are executed by subprocesses, and slow disk I/O will not block the main thread.

However, when Redis directly records AOF logs, data is saved to disk according to different write-back policies. The time it takes to synchronize writing to disk is approximately 1-2ms. If there are a large number of write operations that need to be recorded in the AOF log and synchronized written back, it will block the main thread. This leads to Redis’s fourth blocking point: synchronous writing of AOF logs.

3. Blocking points when interacting between master and slave nodes

In a master-slave cluster, the master node needs to generate an RDB file and transfer it to the slave node. The master node uses subprocesses to complete the creation and transfer of the RDB file during replication, so it does not block the main thread. However, for the slave node, after receiving the RDB file, it needs to use the FLUSHDB command to empty the current database, which just happens to run into the third blocking point we analyzed earlier.

In addition, after emptying the current database, the slave node also needs to load the RDB file into memory. The speed of this process is closely related to the size of the RDB file. The larger the RDB file, the slower the loading process. Therefore, loading the RDB file becomes the fifth blocking point of Redis.

4. Blocking points when interacting between sharded cluster instances

Finally, when deploying a sharded Redis cluster, the hash slot information assigned to each Redis instance needs to be transmitted between different instances. At the same time, when load balancing is required or instances are added or removed, data will be migrated between different instances. However, the amount of information in the hash slots is not large, and data migration is performed progressively, so in general, these two types of operations pose little risk of blocking the main thread of Redis.

However, if you are using the Redis Cluster solution and the migration happens to involve big keys, it will block the main thread because Redis Cluster uses synchronous migration. I will introduce you to the solution to the blocking caused by data migration in different sharded cluster solutions in Lesson 33. Here, you only need to know that when there are no big keys, the instances of the sharded cluster will not block the main thread when interacting.

Alright, now you have an understanding of the key operations in Redis and the blocking operations among them. Let’s summarize the five blocking points we just found:

Full set queries and aggregate operations;
Deleting big keys;
Clearing the database;
Synchronous writing of AOF logs;
Loading RDB files by slave nodes.

If these operations are executed in the main thread, it will undoubtedly prevent the main thread from serving other requests for a long time. To avoid blocking operations, Redis provides an asynchronous thread mechanism. The so-called asynchronous thread mechanism refers to Redis starting some subthreads and assigning certain tasks to these subthreads to be completed in the background, rather than by the main thread. Using the asynchronous thread mechanism to execute operations can avoid blocking the main thread.

However, the question now arises: Can all of these five blocking operations be executed asynchronously?

Which blocking points can be executed asynchronously? #

Before analyzing the feasibility of executing blocking operations asynchronously, let’s first understand the requirements for operations to be executed asynchronously.

If an operation can be executed asynchronously, it means that it is not on the critical path of the Redis main thread. Let me explain what operations on the critical path mean. It means that after the client sends a request to Redis, it waits for Redis to return the data result.

This may sound a bit abstract, let me draw a picture to explain.

After the main thread receives Operation 1, because Operation 1 does not need to return specific data to the client, the main thread can hand it over to a background sub-thread to complete, and at the same time, it only needs to return an “OK” result to the client. While the sub-thread is executing Operation 1, the client sends Operation 2 to the Redis instance. At this time, the client needs to use the data result returned by Operation 2. If Operation 2 does not return the result, the client will be stuck in a waiting state.

In this example, Operation 1 is not considered a critical path operation because it does not need to return specific data to the client, so it can be asynchronously executed by a background sub-thread. However, Operation 2 needs to return the result to the client, so it is a critical path operation, and the main thread must complete this operation immediately.

For Redis, read operations are typical critical path operations, because after the client sends a read operation, it waits for the data to be read and returned for subsequent data processing. Both the first blocking point of Redis, “full collection query and aggregation operations,” involve read operations, so they cannot be executed asynchronously.

Let’s take a look at delete operations again. Delete operations do not need to return specific data results to the client, so they are not considered critical path operations. The second blocking point we just summarized, “bigkey deletion,” and the third blocking point, “clearing the database,” both involve data deletion and are not on the critical path. Therefore, we can use a background sub-thread to perform delete operations asynchronously.

For the fourth blocking point, “AOF log synchronization writing,” in order to ensure data reliability, the Redis instance needs to guarantee that the operations recorded in the AOF log have been written to disk. Although this operation requires the instance to wait, it does not return specific data results to the instance. Therefore, we can also start a sub-thread to perform synchronous writing of the AOF log without making the main thread wait for the completion of writing the AOF log.

Finally, let’s take a look at the blocking point “loading RDB from a replica.” In order for a replica to provide data access services to clients, it must complete loading the RDB file. Therefore, this operation also belongs to the critical path operation, and we must let the main thread of the replica execute it.

For the five blocking points of Redis, except for “full collection query and aggregation operations” and “loading RDB from a replica,” the operations involved in the other three blocking points are not on the critical path. Therefore, we can use Redis’s asynchronous sub-thread mechanism to implement bigkey deletion, clearing the database, and synchronous writing of the AOF log.

So, how does Redis implement the asynchronous sub-thread mechanism specifically?

Asynchronous Subthread Mechanism #

After the Redis main thread is started, it uses the pthread_create function provided by the operating system to create three subthreads, each responsible for asynchronous execution of AOF log writing, key-value pair deletion, and file closing.

The main thread interacts with the subthreads through a linked list form task queue. When it receives key-value pair deletion and clear database operations, the main thread encapsulates these operations into tasks and puts them into the task queue. Then it returns a completion message to the client, indicating that the deletion has been completed.

However, at this point, the deletion has not been executed yet. It is not until the background subthread reads the task from the task queue that it begins to actually delete the key-value pair and free up the corresponding memory space. Therefore, we call this asynchronous deletion lazy free. At this time, the deletion or clear operation does not block the main thread, which avoids performance impact on the main thread.

Similar to lazy free, when the AOF log is configured with the everysec option, the main thread encapsulates the AOF write log operation into a task and puts it into the task queue as well. When the background subthread reads the task, it starts to write the AOF log on its own, so the main thread does not have to wait for the AOF log to be written all the time.

The following diagram shows the mechanism of asynchronous subthread execution in Redis. You can take a look at it to deepen your understanding.

One thing to note here is that the asynchronous key-value pair deletion and database clear operations are provided in Redis 4.0 or later versions. Redis also provides new commands to perform these two operations.

Key-value pair deletion: When you need to delete a large number of elements (such as millions or tens of millions) in your collection type, I recommend using the UNLINK command.
Clear database: You can add the ASYNC option after the FLUSHDB and FLUSHALL commands to let the background subthread clear the database asynchronously, as shown below:

FLUSHDB ASYNC
FLUSHALL ASYNC

Summary #

In this lesson, we learned about the four major interacting objects during the runtime of a Redis instance: clients, disks, master/slave instances, and sharded cluster instances. Based on these four interacting objects, we identified the five major blocking points that can impact Redis performance, including full collection query and aggregation operations, bigkey deletion, database clearing, AOF log synchronization writing, and loading RDB files from slave instances.

Among these five blocking points, bigkey deletion, database clearing, and AOF log synchronization writing are not critical path operations and can be completed using asynchronous threading mechanisms. Redis creates three sub-threads during runtime, and the main thread interacts with these three sub-threads through a task queue. The sub-threads execute corresponding asynchronous operations based on the specific type of task.

However, asynchronous deletion operations were only introduced in Redis 4.0 and later versions. If you are using a version before 4.0 and encounter a bigkey deletion scenario, I have a suggestion for you: use the SCAN command provided by the collection type to read the data first, and then perform the deletion. Using the SCAN command allows you to read and delete a portion of the data at a time, avoiding blocking the main thread caused by deleting a large number of keys at once.

For example, for bigkey deletion in Hash type, you can use the HSCAN command to retrieve a portion of key-value pairs from the hash set (e.g., 200 pairs) and then use HDEL to delete these key-value pairs. By doing so, the deletion pressure is spread over multiple operations, so the time taken for each deletion operation is not too long and will not block the main thread.

Finally, I would like to mention that full collection queries and aggregation operations, as well as loading RDB files from slave instances, are on the critical path and cannot be completed using asynchronous operations. For these two blocking points, I also have two suggestions for you:

For full collection queries and aggregation operations: Use the SCAN command to read data in batches and perform aggregation calculations on the client side.
For loading RDB files from slave instances: Keep the data size of the master instance within the range of 2-4GB to ensure that the RDB file can be loaded at a faster speed.

One Question per Lesson #

As usual, I have a small question for you: Today we learned about operations on the critical path. Do you think Redis write operations (such as SET, HSET, SADD, etc.) are on the critical path?

Feel free to write down your thoughts and answer in the comment section. Let’s discuss and exchange ideas together. If you find today’s content helpful, please feel free to share it with more people. See you in the next lesson.