49 Case Study High CPU Utilization of Kernel Threads, What Should I Do

49 Case Study High CPU Utilization of Kernel Threads, What Should I Do #

Hello, I’m Ni Pengfei.

In the previous issue, we discussed the analysis, localization, and optimization methods for intermittent packet loss in the network. Let’s have a brief recap.

Packet loss in the network usually leads to serious performance degradation, especially for TCP, where packet loss often indicates network congestion and retransmission, which in turn leads to increased network latency and reduced throughput.

When analyzing packet loss problems, we usually follow our usual approach to start from the process of sending and receiving in Linux networking and analyze layer by layer in conjunction with the principles of the TCP/IP protocol stack.

In fact, when troubleshooting network problems, we often come across another problem, which is high CPU usage of kernel threads. For example, in high-concurrency scenarios, the CPU usage of the kernel thread ksoftirqd is usually relatively high. Reviewing what we have learned about CPU and the network module, you should know that this is caused by software interrupts in network transmission.

To analyze kernel threads with high CPU usage, such as ksoftirqd, if you use the analysis methods I mentioned before, you generally need to rely on other performance tools for auxiliary analysis.

For example, let’s take ksoftirqd as an example. If you suspect it is a network problem, you can use tools like sar and tcpdump to analyze network traffic and further confirm the root cause of the network problem.

However, obviously, this method requires more steps during practical operation and may not be considered as a quick solution. You are probably also eager to know if there are any other simpler methods to directly observe the behavior of kernel threads and quickly locate bottlenecks.

Today, I will continue to use ksoftirqd as an example to show you how to analyze performance issues with kernel threads.

Kernel Threads #

Since we are going to discuss performance issues related to kernel threads, let’s first take a look at the common kernel threads.

We know that in Linux, the ancestors of user-space processes are all initiated by the process with PID 1, called the init process. For example, in most mainstream Linux distributions, the init process is systemd; and other user-space processes are managed by systemd.

When thinking about various processes in Linux, aside from user-space processes, there are also a large number of kernel threads. Normally, kernel threads should start before user-space processes, but systemd only manages user-space processes. So, who manages the kernel threads?

In fact, during the Linux startup process, there are three special processes, which are the three processes with the smallest PID numbers.

Process 0 is the idle process, which is the first process created by the system. After initializing processes 1 and 2, it becomes an idle task. When there are no other tasks running on the CPU, it runs.
Process 1 is the init process, usually systemd, running in user space and responsible for managing other user-space processes.
Process 2 is the kthreadd process, running in kernel space and responsible for managing kernel threads.

Therefore, to find kernel threads, we only need to start from process 2 and find its descendant processes. For example, you can use the ps command to find the child processes of kthreadd:

    $ ps -f --ppid 2 -p 2
    UID         PID   PPID  C STIME TTY          TIME CMD
    root          2      0  0 12:02 ?        00:00:01 [kthreadd]
    root          9      2  0 12:02 ?        00:00:21 [ksoftirqd/0]
    root         10      2  0 12:02 ?        00:11:47 [rcu_sched]
    root         11      2  0 12:02 ?        00:00:18 [migration/0]
    ...
    root      11094      2  0 14:20 ?        00:00:00 [kworker/1:0-eve]
    root      11647      2  0 14:27 ?        00:00:00 [kworker/0:2-cgr]

From the above output, you can see that the names (CMD) of the kernel threads are enclosed in brackets (as mentioned earlier). Therefore, a simpler method is to directly search for processes with names enclosed in brackets. For example:

    $ ps -ef | grep "\[.*\]"
    root         2     0  0 08:14 ?        00:00:00 [kthreadd]
    root         3     2  0 08:14 ?        00:00:00 [rcu_gp]
    root         4     2  0 08:14 ?        00:00:00 [rcu_par_gp]
    ...

Understanding the basic functions of kernel threads is very helpful for troubleshooting. For example, we have mentioned ksoftirqd in the case of software interrupts. It is a kernel thread used to handle software interrupts, and there is one for each CPU.

If you are aware of this, when encountering a high CPU usage issue with ksoftirqd in the future, you will first suspect it is a software interrupt problem, and then further analyze it from the perspective of software interrupts.

In addition to kthreadd and ksoftirqd mentioned earlier, there are many commonly seen kernel threads that we frequently encounter in performance analysis. For example, the following kernel threads:

kswapd0: Used for memory reclamation. I have introduced its working principle in the Swap High case.
kworker: Used to execute kernel work queues, divided into bound CPUs (named in the format kworker/CPU:ID) and unbound CPUs (named in the format kworker/uPOOL:ID).
migration: During the load balancing process, it migrates processes to CPUs. Each CPU has one migration kernel thread.
jbd2 /sda1-8: jbd stands for Journaling Block Device, which provides journaling functionality for file systems to ensure data integrity; sda1-8 in the name indicates the disk partition name and device number. Each disk partition using the ext4 file system has a jbd2 kernel thread.
pdflush: Used to write dirty pages in memory (pages that have been modified but not yet written to disk) to disk. (From Linux kernel version 3.10, it has been merged into kworker).

Understanding these kernel threads that are prone to performance issues helps us quickly locate performance bottlenecks. Next, let’s move on to today’s case.

Case Preparation #

Today’s case is still based on Ubuntu 18.04, which is also applicable to other Linux systems. The environment I used for this case is as follows:

Machine configuration: 2 CPUs, 8GB memory.
Pre-install docker, perf, hping3, curl, and other tools, such as apt install docker.io linux-tools-common hping3.

In this case, two virtual machines are involved, and I drew a diagram to represent their relationship.

You need to open two terminals and log in to these two virtual machines separately and install the above-mentioned tools.

Note that all commands below are assumed to be executed as the root user. If you log in to the system as a regular user, please run the command sudo su root to switch to the root user.

If there are any problems during the installation process, you can search online for solutions. If you still can’t solve them, remember to ask me in the comments section.

With that, the preparation work is complete. Next, we will officially enter the operation phase.

Case Study #

After installation, we first run the following command in the first terminal to run a basic Nginx application:

# Run the Nginx service and open port 80 to the outside world
$ docker run -itd --name=nginx -p 80:80 nginx

Then, in the second terminal, use curl to access the port that Nginx is listening on to confirm that Nginx is starting up. Assuming 192.168.0.30 is the IP address of the virtual machine where Nginx is located, after running the curl command, you should see the following output:

$ curl http://192.168.0.30/
<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>
...

Next, still in the second terminal, run the hping3 command to simulate Nginx client requests:

# The -S parameter indicates setting the TCP protocol's SYN (synchronization sequence number),
# -p indicates the destination port is 80
# -i u10 indicates sending a network frame every 10 microseconds
# Note: If you don't see obvious effects during the practice, you can try decreasing 10 to a smaller value, such as 5 or even 1
$ hping3 -S -p 80 -i u10 192.168.0.30

Now, let’s go back to the first terminal, and you should notice that the system’s response has significantly slowed down. Let’s run the top command and observe the CPU usage of the system and processes:

$ top
top - 08:31:43 up 17 min,  1 user,  load average: 0.00, 0.00, 0.02
Tasks: 128 total,   1 running,  69 sleeping,   0 stopped,   0 zombie
%Cpu0:  0.3 us,  0.3 sy,  0.0 ni, 66.8 id,  0.3 wa,  0.0 hi, 32.4 si,  0.0 st
%Cpu1:  0.0 us,  0.3 sy,  0.0 ni, 65.2 id,  0.0 wa,  0.0 hi, 34.5 si,  0.0 st
KiB Mem:  8167040 total,  7234236 free,   358976 used,   573828 buff/cache
KiB Swap:        0 total,        0 free,        0 used.  7560460 avail Mem

PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
9 root      20   0       0      0      0 S   7.0  0.0   0:00.48 ksoftirqd/0
18 root      20   0       0      0      0 S   6.9  0.0   0:00.56 ksoftirqd/1
2489 root      20   0  876896  38408  21520 S   0.3  0.5   0:01.50 docker-containe
3008 root      20   0   44536   3936   3304 R   0.3  0.0   0:00.09 top
1 root      20   0   78116   9000   6432 S   0.0  0.1   0:11.77 systemd
...

From the output of top, you can see that the soft interrupt usage of both CPUs exceeds 30%. The process with the highest CPU usage is the soft interrupt kernel thread, ksoftirqd/0 and ksoftirqd/1.

Although we already know the basic function of ksoftirqd and can guess that the high CPU usage is caused by a large amount of network transmission, we don’t know the specific logic it is executing.

There are many ways to observe the behavior of ordinary processes, such as using strace, pstack, lsof, and so on. But these tools are not suitable for kernel threads. For example, if you use pstack or check the call stack of ksoftirqd/0 (process ID 9) via /proc/pid/stack, you will get the following outputs:

$ pstack 9
Could not attach to target 9: Operation not permitted.
detach: No such process


$ cat /proc/9/stack
[<0>] smpboot_thread_fn+0x166/0x170
[<0>] kthread+0x121/0x140
[<0>] ret_from_fork+0x35/0x40
[<0>] 0xffffffffffffffff

Clearly, the error reported by pstack indicates that attaching to process 9 is not allowed, and the /proc/9/stack method does provide output but it does not contain detailed call stack information.

Are there any other methods to observe the behavior of the kernel thread ksoftirqd?

Since it is a kernel thread, we should naturally use the mechanisms provided by the kernel. Let’s review the CPU performance tools we have used before, and I believe you still remember perf, the built-in kernel performance profiling tool.

perf can sample specified processes or events and can also output the summary information of the entire call chain in the form of a call stack. Let’s use perf to analyze the process with the process ID 9, which is ksoftirqd.

Continue with the command perf record in Terminal 1, and specify process ID 9 to record the behavior of ksoftirqd:

# Record for 30 seconds and exit
$ perf record -a -g -p 9 -- sleep 30

Wait a moment after the above command finishes, and then continue with the command perf report. You will obtain a summary report from perf. By using the up and down arrow keys as well as the enter key, expand the ksoftirqd with the highest proportion, and you will obtain the following call graph:

From this graph, you can clearly see the call processes that ksoftirqd executes the most. Although you may not be familiar with the kernel source code, through these functions, you can roughly understand its call stack process.

net_rx_action and netif_receive_skb indicate that this is receiving network packets (rx means receive).
br_handle_frame indicates that the network packets have passed through the bridge (br means bridge).
br_nf_pre_routing indicates the execution of netfilter’s PREROUTING on the bridge (nf means netfilter). And we already know that PREROUTING is mainly used to perform DNAT, so we can speculate that DNAT is happening here.
br_pass_frame_up indicates that after the bridge processing, it is passed to other bridged network cards for further processing. For example, receiving network packets on a new network card, executing netfilter filtering rules, and so on.

Are our speculations correct? In fact, in our case, we started a container using Docker, and Docker automatically creates a virtual network card for the container, bridges it to the docker0 bridge, and configures NAT rules. This process is shown in the following figure:

Of course, the call chain in the perf report interface can be expanded further. Unfortunately, my screen is not big enough, so if I expand more levels, the last few levels will exceed the screen range. This means that even though we can see most of the call processes, it does not guarantee that there are no issues in the later levels.

So, is there a better way to view the information of the entire call stack?

Flame Graph #

To address the issue of displaying perf summary data, Brendan Gragg invented the Flame Graph, which provides a more intuitive way of visualizing the summary results in the form of a vector graph. The following is an example of a Flame Graph for MySQL.

(Image from Brendan Gregg’s blog)

This graph resembles a flickering flame, hence the name Flame Graph. To understand the Flame Graph, it is important to differentiate the meanings of the horizontal and vertical axes.

The horizontal axis represents the number of samples and the sample ratio. A wider horizontal span for a function indicates a longer execution time. Multiple functions in the same layer are sorted alphabetically.
The vertical axis represents the call stack, expanding from bottom to top. In other words, the function below in the stack is the parent function of the function above it. Consequently, the deeper the call stack, the higher the vertical axis.

Additionally, it is worth noting that the colors in the graph do not have any specific meaning; they are simply used to differentiate between different functions.

The Flame Graph is a dynamic vector graph format, which means it also supports some dynamic features. For example, when hovering over a function with the mouse, it automatically displays the number of samples and the sample ratio for that function. When you click on a function with the mouse, the Flame Graph zooms in on that layer and all the layers above, allowing you to observe the details of the call stack at the top of the Flame Graph.

The aforementioned example of the Flame Graph for MySQL represents CPU utilization. This type of Flame Graph is also known as an on-CPU Flame Graph. Based on the goal of performance analysis, Flame Graphs can be categorized into the following types:

On-CPU Flame Graph: Represents the CPU utilization and is used in scenarios with high CPU usage.
Off-CPU Flame Graph: Represents the blocking situations where the CPU is waiting for I/O, locks, and other resources.
Memory Flame Graph: Represents the allocation and deallocation of memory.
Hot/Cold Flame Graph: Combines the on-CPU and off-CPU Flame Graphs to provide a comprehensive display.
Differential Flame Graph: Represents the difference between two Flame Graphs, with red indicating growth and blue indicating decay. Differential Flame Graphs are commonly used to compare the performance effects before and after system changes, in different scenarios and periods.

Now that we understand the meaning of the Flame Graph and how to view it, let’s go back to the case and use the Flame Graph to observe the records obtained from perf record.

Flame Graph Analysis #

First, we need to generate the flame graph. Let’s start by downloading a few tools that can convert perf record data into flame graphs. These tools are available on https://github.com/brendangregg/FlameGraph. You can download them by running the following commands:

$ git clone https://github.com/brendangregg/FlameGraph
$ cd FlameGraph

Once the tools are installed, there are three main steps to generate a flame graph:

Execute perf script to convert the perf record data into readable sample records.
Execute the stackcollapse-perf.pl script to merge the call stack information.
Execute the flamegraph.pl script to generate the flame graph.

However, on Linux, we can simplify the process by using pipes to execute the three steps in one command. Assuming the path to the file generated by perf record is /root/perf.data, you can directly generate the flame graph by running the following command:

$ perf script -i /root/perf.data | ./stackcollapse-perf.pl --all | ./flamegraph.pl > ksoftirqd.svg

After the command is successfully executed, open ksoftirqd.svg in a browser to view the generated flame graph. It should look like the following image:

According to the principles of the flame graph, this graph should be viewed from bottom to top, analyzing the most frequently executed functions along the widest part of the call stack. The result we see here is similar to the earlier perf report, but much more intuitive. The large stack in the middle is the area we need to focus on.

By following the call stack from bottom to top (along the blue arrows in the graph), we can obtain the same results as the earlier perf report:

Initially, we have net_rx_action and netif_receive_skb for network packet processing.
Next, we have br_handle_frame and br_nf_pre_routing for receiving and executing netfilter hooks in the bridge.
Further up, we have br_pass_frame_up and netif_receive_skb for passing the packet from the bridge to other network devices.

However, the function names become unclear after ip_forward. To see the behavior after ip_forward, click on ip_forward to expand the top part of the call stack:

Expanded Call Stack

This allows us to further examine the behavior after ip_forward, which involves sending the network packets out. Based on this call process and combining it with our earlier knowledge of network packet transmission and the TCP/IP protocol stack, we can identify that network reception, the bridge, and netfilter calls are all important factors contributing to high CPU usage in soft interrupts, and potential bottlenecks affecting network performance.

However, if we recall the process of network packet transmission, we may notice that many steps are missing.

For example, this call stack does not include any TCP-related calls or functions related to connection tracking (conntrack). In reality, these processes are represented by smaller flame graphs. You can click on the “Reset Zoom” button in the top left corner of the image to go back to the complete flame graph and examine the call stacks of other smaller flames.

Therefore, it is important to note that understanding this call stack requires attention. Any call stack viewed vertically from any point represents only the call stack of the topmost function, rather than the complete execution flow of the kernel networking process.

Additionally, the entire flame graph does not include any time-related factors, so we cannot determine the execution sequence of the various functions horizontally.

At this point, we have identified the most frequently executed function call stack within the ksoftirqd kernel thread, and the functions in this stack are the potential sources of performance bottlenecks. With this information, we have a starting point for further analysis and optimization.

Summary #

You may find today’s case familiar. In fact, this case is exactly the software interrupt case in our CPU module column.

At that time, we started from the perspective of software interrupt CPU usage and used network packet capture to find the bottleneck source, confirming that it was caused by a large number of SYN packets sent by the test machine. With today’s perf and flame graph methods, we further identified the hot functions of the software interrupt kernel thread, which actually identified potential bottlenecks and optimization directions.

In fact, if you encounter abnormal resource usage of kernel threads, many commonly used process-level performance tools may not be helpful. At this time, you can use the perf provided by the kernel to observe their behavior, find hot functions, and further locate performance bottlenecks. Of course, the summary report generated by perf is not intuitive enough, so I also recommend using flame graphs to assist in debugging.

In fact, the flame graph method is also applicable to ordinary processes. For example, when analyzing performance issues in various application scenarios such as Nginx, MySQL, etc., flame graphs can help you locate hot functions and identify potential performance problems more quickly.

Reflection #

Finally, I would like to invite you to discuss the performance issues you have encountered with kernel threads. How do you analyze their root causes? And how do you solve them? You can summarize your thoughts based on my presentation.

Feel free to leave a comment and discuss with me. You are also welcome to share this article with your colleagues and friends. Let’s practice in real situations and make progress through communication.