17 Basics How Does the CPU Execute Tasks

17 Basics How Does the CPU Execute Tasks #

Hello, I am Sha Yafang.

If you have done performance optimization before, you may have had these thoughts, for example:

How can we make the CPU read data faster?
Why is it sometimes slower and sometimes faster when performing the same task?
My task is relatively important. What should I do if there is competition for CPU resources and I want these tasks to be executed first?
How does multi-threading ensure data synchronization during parallel read and write operations?
…

To understand these questions, you need to understand how the CPU executes tasks. Only by understanding the execution logic of the CPU can you better control the execution of your tasks and achieve better performance.

How does the CPU read and write data? #

Let me show you the architecture of the CPU. Only by understanding the architecture of the CPU can you better understand how it executes instructions. The architecture diagram of the CPU is as follows:

As you can see, for modern processors, a physical CPU usually has two logical threads, which are Core 0 and Core 1 in the diagram. Each core has its own L1 Cache, which is further divided into dCache and iCache, corresponding to L1d and L1i in the diagram. The L1 Cache is only visible to the core itself and cannot be seen by other cores. The two cores in the same physical CPU share the L2 Cache, but other physical CPUs cannot see this L2 Cache. All physical CPUs share the L3 Cache. This is a typical CPU architecture.

As you can see, there is also memory (DRAM), disk, and other storage media outside the CPU, which together form the storage hierarchy in the system architecture. It is shown as follows:

In this “hierarchy”, as you go down, the storage capacity becomes larger, but the speed becomes slower. Jeff Dean has studied the access latency of CPUs to various storage media. You can refer to the data in latency, which details the access latency of each storage level. This is some latency data that we must know in performance optimization. Don’t underestimate it, in some scenarios, the difference in access latency of these different storage levels may lead to significant performance differences.

Let’s take the access latency of the cache (L1 0.5ns, L2 10ns) and memory (100ns) as an example. I will give you a practical example to illustrate the impact of latency differences on performance.

Previously, when I was working on a network tracing system, in order to track TCP connections more conveniently, I submitted a PATCH to the Linux Kernel to record the number of each connection. You can refer to the commit net: init sk_cookie for inet socket for the details of this PATCH. The general purpose of this PATCH is that every time a new TCP connection is created (you can review the content of the previous module for knowledge about TCP), it uses the cookie_gen in the net namespace to generate a cookie and assigns it to the new connection.

However, after this PATCH was merged, Google engineer Eric Dumazet found that the network throughput decreased by about 24% in his SYN Flood test. After analysis, it was found that this was because all TCP connections in the net namespace shared the cookie_gen. In high-concurrency scenarios, there would be a large number of new TCP connections in an instant. This made the cookie_gen a hot data, and it was cached in the cache. If the content of cookie_gen is modified, the data in the cache will become invalid. So when other new connections need to access this data, they have to read it from memory again. As you know, the latency of memory is much higher than that of the cache, which leads to a significant performance degradation. This problem is a typical False Sharing, which is the Cache False Sharing problem.

Because this PATCH caused such a serious performance loss in high-concurrency connection establishment scenarios, we decided to revert it. You can see the commit Revert “net: init sk_cookie for inet socket” for details. However, cookie_gen is still useful for network tracing, such as when tracing TCP connections of cgroups using eBPF. So later, an engineer from Facebook moved it out of the net namespace structure and changed it to a global variable.

Because the net namespace is shared by many TCP connections, this structure is very prone to Cache False Sharing problems. Eric Dumazet also introduced a Cache False Sharing problem here: net: reorder ‘struct net’ fields to avoid false sharing.

Next, let’s take a look at what the Cache False Sharing problem is.

As shown in the above diagram, two different threads are running in parallel on two CPUs. They both read two different data from memory, and the addresses of these two data are continuous in physical memory, located in the same cache line. When CPU reads data from memory into the cache, it is done in units of cache lines. Therefore, the data in this cache line is read into the cache of both CPUs at the same time. Then these two threads respectively modify different data. Each time the data in the cache is modified, the entire cache line is invalidated. Therefore, although the data modified by these two threads is different, because they are located in the same cache line, when a thread in one CPU writes data, it invalidates the cache line in the other CPU. As a result, the thread in the other CPU will have a cache miss when reading or writing data, and then it needs to read the data from memory, which greatly reduces performance.

Cache False Sharing problems can be considered as performance killers. When writing code, we must pay attention to those frequently modified shared data. If necessary, we can place them in different cache lines from other hot data to avoid the False Sharing problem, just like what we often see in kernel code with ____cacheline_aligned.

So how do we observe Cache False Sharing problems? You can use the command perf c2c, but this requires support from a newer version of the kernel. However, perf can also observe cache misses, which is helpful in analyzing many performance issues.

When the CPU finishes writing to the cache, it invalidates the cache, which fundamentally ensures the data consistency in multi-core parallel computing. The consistency problem is a typical problem in the cache storage level. Let’s take a look at a typical problem in the memory storage hierarchy: competition in parallel computing, which refers to the competition that occurs when two CPUs simultaneously operate on the same physical memory address. I will give you some simple examples to illustrate this type of problem.

Taking C language as an example:

struct foo {
    int a;
    int b;
};

In this example code, we define a structure with two members, a and b, which are contiguous in memory. If CPU 0 writes to a while CPU 1 reads b, there will be no competition because a and b have different addresses. However, since a and b are contiguous in memory, they may belong to the same cache line. In order to prevent the cache false sharing problem mentioned earlier, we can force aligning the address of b to a cache line, as follows:

struct foo {
    int a;
    int b ____cacheline_aligned;
};

Next, let’s look at another scenario:

struct foo {
    int a:1;
    int b:1;
};

This example program defines two bit fields, a and b, which share the same address but occupy different bits. In this case, if CPU 0 writes to a (a = 1) while CPU 1 writes to b (b = 1), competition will occur. After bus arbitration, the data written later will overwrite the data written earlier. This is a typical race condition when performing Read-Modify-Write (RMW) operations. In such scenarios, synchronization primitives are required, such as using atomic operations.

Regarding bitwise operations, let’s take a look at a practical case. This is a patch I contributed to the Linux kernel some time ago: psi: Move PF_MEMSTALL out of task->flags. It adds a bit field in_memstall to the struct task_struct in the patch. In this case, we don’t need to consider the competition problem when multiple threads operate on this bit field simultaneously. Do you know why? I’ll leave this as a post-lesson question for you to think about, and I welcome you to discuss and exchange ideas with me in the comments section. To simplify the question, I’ll give you a hint: If you have noticed how the bit fields in the task_struct are modified, you will find that only current (the currently running thread) can write to them, while other threads can only read. However, the global flags like PF_* can be written by other threads, not just current.

The task_struct structure in the Linux kernel is used to represent a thread, and each thread has a unique corresponding task_struct structure. It is also the basic unit for scheduling in the kernel. Let’s continue to see how the CPU selects a thread to execute.

How does the CPU choose which thread to execute? #

As you may know, a system can have a large number of threads running simultaneously, which can be more than the number of CPU cores in the system. In such cases, these tasks need to be queued, and each CPU maintains its own runqueue, where the threads are stored. The structure of this runqueue is roughly as shown in the following diagram:

Each CPU has its own runqueue, and the threads that need to be executed are added to this queue. To ensure the execution of high-priority tasks, Linux kernel sets different scheduling classes, as shown below:

The priorities of these scheduling classes are as follows: Deadline > Realtime > Fair. When selecting the next task to execute, the Linux kernel follows this order, meaning it first selects a task from dl_rq, then from rt_rq, and finally from cfs_rq. Therefore, real-time tasks are always executed before normal tasks.

If you have tasks with low tolerance for latency, such as those in embedded systems, you can consider setting these tasks as real-time tasks, such as setting them as SCHED_FIFO tasks:

$ chrt -f -p 1 1327

By default, user threads are normal threads belonging to the Fair scheduling class and are managed by the CFS scheduler. The purpose of the CFS scheduler is to achieve fairness in thread execution. For example, if two threads on a CPU need to execute, each thread will be allocated 50% of the CPU time to ensure fairness. However, the ratio of execution time between threads can also be manually adjusted. In Linux, for example, you can adjust the nice value of a process to artificially intervene and allow threads with higher priority to execute for a longer period of time. This is the basic idea of the CFS scheduler.

Alright, that’s all for this lesson.

Class Summary #

Let me summarize the key points of this class:

To understand how the CPU executes tasks, you first need to understand the architecture of the CPU.
The storage hierarchy of the CPU has a significant impact on the performance of large software systems and needs to be carefully considered during performance tuning.
The cache line false sharing issue is a commonly encountered problem in high-concurrency scenarios that you need to be aware of.
The number of threads needed to run in the system may exceed the number of CPU cores, which can lead to thread queuing and waiting for the CPU, causing some latency. If your tasks have low tolerance for latency, you can artificially intervene the default scheduling strategy of Linux through some means.

Homework #

The homework for this lesson is the thought question that we mentioned earlier: in the psi: Move PF_MEMSTALL out of task->flags PATCH, why wasn’t the issue of race conditions considered when adding the new bit field (in_memstall) for multi-threaded parallel operations? Feel free to discuss this in the comments section.

Thank you for reading, and if you found this lesson helpful, please feel free to share it with your friends. See you in the next lesson.