17 Why CPU Architecture Can Also Affect Redis Performance

17 Why CPU Architecture Can Also Affect Redis Performance #

Many people think that the relationship between Redis and the CPU is simple: Redis threads run on the CPU, and if the CPU is fast, Redis can process requests quickly.

However, this understanding is actually one-sided. The multi-core architecture and multi-CPU architecture of the CPU can also affect the performance of Redis. Without understanding the impact of the CPU on Redis, it is possible to miss some optimization methods when tuning the performance of Redis, and the full potential of Redis performance cannot be realized.

Today, let’s learn about the CPU architectures of mainstream servers and the methods of optimizing Redis performance based on CPU multi-core architecture and multi-CPU architecture.

Mainstream CPU Architectures #

To understand how CPU affects Redis, we need to first understand CPU architectures.

A CPU processor generally has multiple execution cores, which we call physical cores. Each physical core can run an application program. Each physical core has its own private Level 1 cache, which includes a Level 1 instruction cache and a Level 1 data cache, as well as a private Level 2 cache.

Here, we mention the concept of private cache for physical cores. It means that the cache space can only be used by the current physical core, and other physical cores cannot access the cache space of this core. Let’s take a look at the architecture of CPU physical cores.

Since L1 and L2 caches are private to each physical core, accessing them has a latency of less than 10 nanoseconds, which is very fast. Therefore, if Redis stores the instructions or data it needs in the L1 and L2 caches, it can access these instructions and data at high speed.

However, the size of these L1 and L2 caches is limited by the manufacturing technology of the processor, generally only in the kilobyte (KB) range, so they cannot hold a large amount of data. If the required data is not in the L1 and L2 caches, the application program needs to access the memory to retrieve the data. The memory access latency of an application program is generally in the hundreds of nanoseconds, which is about 10 times the latency of accessing the L1 and L2 caches, inevitably affecting performance.

Therefore, different physical cores also share a common Level 3 cache (L3 cache). The L3 cache has more available storage resources and is generally larger, ranging from several megabytes (MB) to several tens of megabytes (MB), which allows the application program to cache more data. When there is no data cache in the L1 and L2 caches, accessing the L3 cache can avoid accessing the memory as much as possible.

In addition, in modern mainstream CPU processors, each physical core usually runs two hyper-threads, also known as logical cores. The logical cores of the same physical core share the utilization of the L1 and L2 caches.

To help you understand, I will use a diagram to show the relationship between physical cores and logical cores, as well as the Level 1 and Level 2 caches.

In mainstream servers, a CPU processor usually has 10 to more than 20 physical cores. Moreover, to enhance the processing power of the server, there are usually multiple CPU processors (also known as multiple CPU sockets) installed, each with its own physical cores (including L1 and L2 caches), L3 cache, and connected memory. The different processors are also interconnected through a bus.

The following diagram shows the architecture of multiple CPU sockets, with two sockets shown, each with two physical cores.

On a multi-CPU architecture, an application program can run on different processors. In the diagram above, Redis can run on Socket 1 for a period of time and then be scheduled to run on Socket 2.

However, there is one thing you need to note: If an application program runs on one Socket first and saves data into memory, and then gets scheduled to run on another Socket, when the application program accesses memory again, it needs to access the memory connected to the previous Socket. This type of access is known as remote memory access. Compared to accessing memory directly connected to the Socket, remote memory access increases the latency of the application program.

In a multi-CPU architecture, the latency of accessing the local memory of the Socket where an application program is located is not consistent with the latency of accessing remote memory, so we also call this architecture a Non-Uniform Memory Access (NUMA) architecture.

By now, we have learned about the mainstream CPU multi-core architecture and multi-CPU architecture. Let’s summarize the impact of CPU architecture on the execution of application programs.

The access speed of instructions and data in the L1 and L2 caches is very fast, so making full use of the L1 and L2 caches can effectively shorten the execution time of application programs.
Under NUMA architecture, if an application program is scheduled from one Socket to another Socket, remote memory access may occur, which directly increases the execution time of the application program.

Next, let’s first understand how CPU multi-core architecture affects the performance of Redis.

Impact of Multiple Cores on Redis Performance #

When running on a single CPU core, the application needs to record the runtime information of the software and hardware resources it uses (such as stack pointers, register values of CPU cores, etc.), which we call runtime information. At the same time, the most frequently accessed instructions and data of the application will also be cached in the L1 and L2 caches to improve execution speed.

However, in the scenario of a multi-core CPU, once the application needs to run on a new CPU core, the runtime information needs to be reloaded onto the new CPU core. In addition, the L1 and L2 caches of the new CPU core also need to reload data and instructions, which can increase the program’s execution time.

Speaking of this, I would like to share with you a case where I once optimized the performance of Redis in a multi-core CPU environment. I hope this case can help you fully understand the impact of multi-core CPUs on the performance of Redis.

At that time, our project requirement was to optimize the 99th percentile latency of Redis, with a requirement of GET latency less than 300 microseconds and PUT latency less than 500 microseconds.

Some of you may not be familiar with what the 99th percentile latency means, so let me explain. We sort all request processing latencies from smallest to largest, and the value below which 99% of the requests’ latencies fall is the 99th percentile latency. For example, if we have 1000 requests and the measured latency of the 991st request is 1ms, and all the previous 990 requests have latencies less than 1ms, then the 99th percentile latency here is 1ms.

At the beginning, we used the String data type with a GET/PUT complexity of O(1) for data storage and retrieval. At the same time, we disabled RDB and AOF, and there were no other collection types of data stored in the Redis instance, which also meant there were no bigkey operations, avoiding many situations that could increase latency.

However, even with these measures, when running Redis instances on a server with 24 CPU cores, the 99th percentile latencies of GET and PUT were 504 microseconds and 1175 microseconds, respectively, which were significantly higher than our set targets.

Later, we carefully examined the state indicators of the server CPU during the runtime of the Redis instance and discovered that the context switch count of the CPU was relatively high.

Context switch refers to the switching of threads’ contexts, where the context refers to the runtime information of the threads. In a multi-core CPU environment, a thread runs on one CPU core first and then switches to another CPU core to run, which triggers a context switch.

When a context switch occurs, the runtime information of the Redis main thread needs to be reloaded onto the other CPU core. Meanwhile, the L1 and L2 caches of the other CPU core do not have the instructions and data that were frequently accessed by the Redis instance before, so these instructions and data need to be reloaded from the L3 cache or even the memory. This reloading process takes time. Moreover, the Redis instance needs to wait for this reloading process to complete before it can start processing requests, which also increases the processing time of some requests.

If the Redis instance is frequently scheduled to run on different CPU cores in a CPU multi-core scenario, the impact on the processing time of Redis requests will be even greater. With each scheduling, some requests will be affected by the reloading process of runtime information, instructions, and data, which leads to significantly higher latency compared to other requests. Analyzing this, we understand the reason why the value of the 99th percentile latency in the previous example couldn’t be reduced.

Therefore, we want to avoid Redis constantly being scheduled for execution on different CPU cores. So, we tried to bind the Redis instance to a specific CPU core, ensuring that a Redis instance runs on a fixed CPU core. We can use the taskset command to bind a program to a specific core for execution.

For example, by executing the following command, we bind the Redis instance to core 0, where the “-c” option is used to set the core number to bind:

taskset -c 0 ./redis-server

After binding, we conducted tests and found that the 99th percentile latencies of GET and PUT in the Redis instance dropped to 260 microseconds and 482 microseconds, respectively, achieving our expected targets.

Let’s take a look at the 99th percentile latencies of Redis before and after binding the core.

Redis 99th percentile latencies before and after binding

It can be seen that in a multi-core CPU environment, binding Redis instance and CPU cores can effectively reduce Redis latency. Binding not only benefits reducing percentile latencies but also reduces average latency, improves throughput, and thus enhances Redis performance.

Next, let’s take a look at the impact of multiple CPU architectures, particularly NUMA architecture, on Redis performance.

The Impact of NUMA Architecture on Redis Performance #

In practical applications of Redis, I often see a practice of binding the interrupt handling program of the operating system to CPU cores in order to improve the network performance of Redis. This practice can avoid the frequent scheduling and execution of the interrupt handling program on different cores, effectively improving the network processing performance of Redis.

However, the interrupt program needs to interact with the Redis instance for network data exchange. Once the interrupt program is bound to a core, we need to pay attention to which core the Redis instance is bound to, as this will affect the efficiency of Redis accessing network data.

Let’s first take a look at the data interaction between the Redis instance and the interrupt program: the interrupt handling program reads data from the network card hardware and writes the data into a memory buffer maintained by the operating system kernel. The kernel triggers an event through the epoll mechanism to notify the Redis instance, and then the Redis instance copies the data from the kernel’s memory buffer to its own memory space, as shown in the following diagram:

So, under the NUMA architecture of the CPU, when the interrupt handling program and the Redis instance are bound to separate CPU cores, there is a potential risk: if the CPU cores bound by the interrupt handling program and the Redis instance are not in the same CPU socket, then when the Redis instance reads network data, it needs to access memory across CPU sockets, which takes more time.

This may sound a bit abstract, so let me explain with the help of a diagram.

As you can see in the diagram, the interrupt handling program is bound to a core on CPU Socket 1, while the Redis instance is bound to CPU Socket 2. At this time, the network data read by the interrupt handling program is stored in the local memory of CPU Socket 1. When the Redis instance needs to access network data, it needs Socket 2 to send memory access commands to Socket 1 through the bus for remote access, which incurs a relatively large time overhead.

We have conducted tests before, and compared to accessing local memory of the CPU socket, accessing memory across CPU sockets increases the memory access latency by 18%, which naturally leads to an increase in the request processing latency of Redis.

Therefore, to avoid Redis accessing network data across CPU sockets, it is best to bind the interrupt program and the Redis instance to the same CPU socket. This way, the Redis instance can directly read network data from local memory, as shown in the following diagram:

However, it should be noted that under the NUMA architecture of the CPU, the numbering rule for CPU cores is not to sequentially number all logical cores in one CPU socket first and then in the next CPU socket, but to first give each physical core in each CPU socket a consecutive numbering for its first logical core, and then give a consecutive numbering for the second logical core of each physical core in each CPU socket.

Let me give you an example. Suppose there are 2 CPU sockets, each socket has 6 physical cores, and each physical core has 2 logical cores, for a total of 24 logical cores. We can use the lscpu command to view the numbering of these cores:

lscpu

Architecture: x86_64
...
NUMA node0 CPU(s): 0-5,12-17
NUMA node1 CPU(s): 6-11,18-23
...

As you can see, the CPU core numbering for NUMA node0 is 0 to 5, 12 to 17. Among them, 0 to 5 are the numbering of the first logical core of the 6 physical cores on node0, and 12 to 17 are the numbering of the second logical core of the corresponding physical cores. The CPU core numbering rule for NUMA node1 is the same as node0.

Therefore, when binding the cores, we must be careful not to assume that the first 12 logical cores on the first socket are numbered from 0 to 11. Otherwise, the interrupt handling program and the Redis instance may be bound to different CPU sockets.

For example, if we bind the interrupt handling program and the Redis instance to CPU cores numbered 1 and 7 respectively, they are still on 2 different CPU sockets, and the Redis instance still needs to access network data across sockets.

Therefore, you must pay attention to the numbering method of CPU cores under the NUMA architecture to avoid binding the wrong cores.

Let’s briefly summarize what we have learned just now. In a multi-core CPU scenario, using the taskset command to bind the Redis instance to a core can reduce the overhead of the Redis instance being scheduled and executed on different cores, avoiding high tail latency; in a multi-CPU NUMA architecture, if you bind the interrupt handling program to cores, it is recommended to bind the Redis instance and the interrupt handling program to different cores on the same CPU socket to avoid the time overhead of Redis accessing network data in memory across sockets.

However, “there are two sides to every coin”, and core binding also has certain risks. Next, let’s learn about its potential risks and solutions.

Risks and Solutions of Core Binding #

Redis has a main thread, as well as subprocesses for RDB generation and AOF rewriting (please review [Lesson 4] and [Lesson 5]). In addition, we learned about Redis background threads in [Lesson 16].

When we bind a Redis instance to a CPU logical core, it leads to competition for CPU resources among the subprocesses, background threads, and the Redis main thread. If a subprocess or background thread occupies the CPU, the main thread will be blocked, resulting in increased Redis request latency.

To address this issue, I will introduce two solutions: binding one physical core to one Redis instance and optimizing the Redis source code.

Solution 1: Binding one physical core to one Redis instance

When binding cores to Redis instances, instead of binding one instance to one logical core, we should bind it to one physical core, meaning both logical cores of one physical core should be utilized.

Let’s take the previous example of a NUMA architecture. CPU core numbers for NUMA node0 are 0 to 5 and 12 to 17. Among them, 0 and 12, 1 and 13, 2 and 14, etc. represent the two logical cores of one physical core. Therefore, when binding cores, we should use the two logical cores belonging to the same physical core. For example, by executing the following command, the Redis instance is bound to logical cores 0 and 12, both of which belong to physical core 1.

taskset -c 0,12 ./redis-server

Compared to binding only one logical core, binding the Redis instance to a physical core allows the main thread, subprocesses, and background threads to share the use of two logical cores, which can alleviate CPU resource competition to some extent. However, since only two logical cores are used, CPU competition between them still exists. If you want to further reduce CPU competition, let me introduce another solution.

Solution 2: Optimizing the Redis source code

This solution involves modifying the Redis source code to bind the subprocesses and background threads to different CPU cores.

If you are not familiar with the Redis source code, don’t worry, as this is a common practice of core binding achieved through programming. After becoming familiar with the source code, you can apply this solution or use it in other core binding scenarios.

Next, I will first introduce the general procedure and then explain which part of the Redis source code this procedure corresponds to.

When implementing core binding through programming, we need to use the following: one data structure cpu_set_t and three functions: CPU_ZERO, CPU_SET, and sched_setaffinity. Let me explain them.

The cpu_set_t data structure is a bitmap, with each bit representing one CPU logical core on the server.
The CPU_ZERO function takes a bitmap from the cpu_set_t structure as an input parameter and sets all bits of the bitmap to 0.
The CPU_SET function takes a CPU logical core number and a bitmap from the cpu_set_t structure as parameters and sets the bit corresponding to the input logical core number to 1.
The sched_setaffinity function takes a process/thread ID number and a cpu_set_t as parameters, checks which bit in the cpu_set_t is 1, and binds the process/thread represented by the input ID number to the corresponding logical core.

So, how do we combine these three functions together to achieve core binding in programming? It’s simple. We can follow four steps.

Step 1: Create a bitmap variable of the cpu_set_t structure.
Step 2: Use the CPU_ZERO function to set all bits of the bitmap in the cpu_set_t structure to 0.
Step 3: Use the CPU_SET function based on the logical core numbers to bind the corresponding bits of the bitmap in the cpu_set_t structure to 1.
Step 4: Use the sched_setaffinity function to bind the program to the logical cores where the corresponding bits in the bitmap of the cpu_set_t structure are 1.

Now, I will explain specifically how to bind the background threads and subprocesses to different cores by showing examples.

Let’s start with the background threads. To help you better understand core binding through programming, you can take a look at this sample code that binds a thread to a core:

```c
// Thread function
void worker(int bind_cpu){
    cpu_set_t cpuset;  // Create a bitmap variable
    CPU_ZERO(&cpuset); // Set all bits of the bitmap variable to 0
    CPU_SET(bind_cpu, &cpuset); // Set the corresponding bit of the bitmap variable to 1 based on the input bind_cpu number
    sched_setaffinity(0, sizeof(cpuset), &cpuset); // Bind the program to the logical core with the corresponding bit set to 1 in the cpu_set_t bitmap

    // Actual thread function work
}

int main(){
    pthread_t pthread1;
    // Bind the created pthread1 to logical core 3
    pthread_create(&pthread1, NULL, (void *)worker, 3);
}

For Redis, it creates background threads in the bioProcessBackgroundJobs function in the bio.c file. This function is similar to the worker function in the previous example, where it implements the four steps to bind the thread to a different core than the main thread.

Similar to binding a thread to a core, when creating a child process using fork, we can also implement the four steps mentioned earlier in the child process code. The sample code is as follows:

int main(){
   // Use fork to create a child process
   pid_t p = fork();
   if(p < 0){
      printf(" fork error\n");
   }
   // Child process code section
   else if(!p){
      cpu_set_t cpuset;  // Create a bitmap variable
      CPU_ZERO(&cpu_set); // Set all bits of the bitmap variable to 0
      CPU_SET(3, &cpuset); // Set the 3rd bit of the bitmap to 1
      sched_setaffinity(0, sizeof(cpuset), &cpuset);  // Bind the program to logical core 3
      // Actual child process work
      exit(0);
   }
   ...
}

For Redis, the child processes responsible for generating RDB and AOF log rewrite are implemented in the following two functions:

File: rdb.c, Function: rdbSaveBackground
File: aof.c, Function: rewriteAppendOnlyFileBackground

Both of these functions use fork to create child processes, so we can add the four steps of binding the process to a core in the child process code section.

Using the source code optimization approach, we can not only bind Redis instances to specific cores to avoid performance impact caused by core switching, but also run child processes, background threads, and main threads on different cores to avoid competition for CPU resources between them. Compared to using taskset to bind cores, this approach further reduces the risk of binding to a core.

Summary #

In this lesson, we learned about the impact of CPU architecture on Redis performance. First, we understood the mainstream multi-core CPU architecture and NUMA architecture.

Under a multi-core CPU architecture, if Redis runs on different cores, it will require frequent context switching, which increases Redis execution time and results in higher tail latency observed by clients. Therefore, it is recommended to bind the Redis instance to a specific core during runtime. This way, the L1 and L2 caches on the core can be reused, reducing response latency.

To improve Redis network performance, sometimes the network interrupt handler is bound to a specific CPU core as well. In this case, if the server uses NUMA architecture, if the Redis instance is scheduled to a different CPU socket from the interrupt handler, it will have to access network data across CPU sockets, resulting in decreased Redis performance. Therefore, I suggest binding the Redis instance and network interrupt handler to different cores on the same CPU socket, which can enhance Redis runtime performance.

Although CPU binding can help reduce request execution time for Redis, besides the main thread, Redis also has sub-processes for RDB and AOF rewriting, as well as background threads for lazy deletion introduced in version 4.0. When the Redis instance is bound to a logical core, these sub-processes and background threads will compete for CPU resources with the main thread, affecting Redis performance. Therefore, I have two recommendations for you:

If you don’t want to modify the Redis code, you can bind Redis instances to physical cores, so that the main thread, sub-processes, and background threads can share two logical cores on a physical core.
If you are familiar with the Redis source code, you can add CPU binding operations in the source code to bind the sub-processes and background threads to different cores, thus avoiding CPU resource competition with the main thread. However, if you are not familiar with the Redis source code, don’t worry too much. After the release of Redis 6.0, it will support configuration operations for CPU core binding. I will introduce the latest features of Redis 6.0 in Lesson 38.

The low latency of Redis is our eternal pursuit goal, and multi-core CPUs and NUMA architecture have become the mainstream configuration for servers. Therefore, I hope you can master the CPU binding optimization solution and apply it in practice.

One Question per Lesson #

As usual, I have a small question for you.

On a server with 2 CPU Sockets (each socket has 8 physical cores), we deployed a Redis sharded cluster with 8 instances (all instances are master nodes, without master-slave relationships). Now we have two options:

Run 8 instances on the same CPU Socket and bind them to the 8 CPU cores.
Run 4 instances on each of the 2 CPU Sockets and bind them to the corresponding cores on each socket.

If we don’t consider the impact of network data retrieval, which option would you choose?

Feel free to write down your thoughts and answer in the comments. If you find this helpful, you are also welcome to share today’s content with your friends. See you in the next lesson.