10 Case Study High CPU Usage Due to Soft Lockup, What Should I Do

10 Case Study High CPU Usage Due to Soft Lockup, What Should I Do #

Hello, I’m Ni Pengfei.

In the previous article, I talked about the basic principles of soft interrupts. Let’s briefly review.

Interrupt is an asynchronous event handling mechanism used to improve the system’s concurrent processing capability. When an interrupt event occurs, it triggers the execution of an interrupt handler, which is divided into two parts: the top half and the bottom half.

The top half corresponds to hardware interrupts and is used to quickly handle interrupts.
The bottom half corresponds to soft interrupts and is used to asynchronously handle unfinished work from the top half.

Soft interrupts in Linux include various types such as network transmit and receive, timers, scheduling, RCU locks, etc. We can check the /proc/softirqs file in the proc file system to observe the running status of soft interrupts.

In Linux, each CPU corresponds to a soft interrupt kernel thread with a name in the format of ksoftirqd/CPU number. When the frequency of soft interrupt events is too high, the kernel thread may not be able to handle the soft interrupts in a timely manner due to high CPU usage, resulting in performance issues such as network latency and slow scheduling.

High CPU usage of soft interrupts is also a common performance issue. Today, I will use the case of the most common reverse proxy server, Nginx, to teach you how to analyze this situation.

Case Study #

Your Preparation #

The following case study is based on Ubuntu 18.04, but it also applies to other Linux systems. The environment I used for this case study is as follows:

Machine configuration: 2 CPUs, 8 GB RAM.
Pre-install tools such as docker, sysstat, sar, hping3, tcpdump, etc. For example, you can run apt-get install docker.io sysstat hping3 tcpdump.

Here, I have also used three new tools: sar, hping3, and tcpdump. Let me briefly introduce them:

sar is a system activity report tool that can be used to view the current activity of the system, as well as configure the storage and reporting of historical statistical data.
hping3 is a tool that can construct TCP/IP protocol packets, which can be used for security auditing, firewall testing, etc.
tcpdump is a commonly used network packet capture tool, often used to analyze various network problems.

This case study involves two virtual machines, and I have drawn a diagram to represent their relationship.

You can see that one of the virtual machines runs Nginx, which simulates the web server to be analyzed; while the other machine acts as the client of the web server, used to add stress requests to Nginx. The purpose of using two virtual machines is to isolate them from each other and avoid “cross-contamination”.

Next, we open two terminals, SSH login to the two machines respectively, and install the aforementioned tools.

As with previous case studies, all the following commands are assumed to be run as the root user. If you are logged into the system as a regular user, please run the command sudo su root to switch to the root user.

If you encounter any issues during the installation process, I encourage you to search for solutions by yourself. If you cannot solve them, feel free to ask me in the comment section. If you have already installed everything before, you can ignore this.

Operation and Analysis #

After the installation is complete, let’s first run the following command in the first terminal to start the case study, which is a basic Nginx application:

# Run the Nginx service and expose port 80 to the outside
$ docker run -itd --name=nginx -p 80:80 nginx

Then, in the second terminal, use curl to access the port that Nginx listens on to confirm that Nginx is started properly. Assuming that 192.168.0.30 is the IP address of the virtual machine where Nginx resides, after running the curl command, you should see the following output:

$ curl http://192.168.0.30/
<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>
...

Next, in the second terminal, we run the hping3 command to simulate the client request to Nginx:

# The -S parameter indicates setting the TCP protocol SYN (synchronization sequence number), -p indicates the destination port is 80
# The -i u100 indicates sending a network frame every 100 microseconds
# Note: If you don't see obvious effects during the practice, you can try reducing the value to 10 or even 1
$ hping3 -S -p 80 -i u100 192.168.0.30

Now let’s go back to the first terminal. You should have noticed some abnormal behavior. The system response seems to have slowed down significantly. Even typing a few commands in the terminal takes a long time to get a response. What should be done in this situation?

Although I already told you that when running the hping3 command, it is a SYN FLOOD attack, you may want to analyze this problem from the network perspective. However, in a real production environment, no one will directly tell you the cause.

So, I hope you can temporarily forget about the hping3 simulation of the SYN FLOOD attack, and start analyzing the system resource usage based on the observed problem, gradually finding the root cause of the problem.

Then, where should we start? We just found out that even simple shell commands have significantly slowed down. It would be a good idea to check the overall resource usage of the system first, for example, by running the top command to see if there is a CPU bottleneck. Let’s run the top command in the first terminal to check the overall resource usage of the system.

# Press 1 in top to switch to displaying all CPUs
$ top
top - 10:50:58 up 1 days, 22:10,  1 user,  load average: 0.00, 0.00, 0.00
Tasks: 122 total,   1 running,  71 sleeping,   0 stopped,   0 zombie
%Cpu0  :  0.0 us,  0.0 sy,  0.0 ni, 96.7 id,  0.0 wa,  0.0 hi,  3.3 si,  0.0 st
%Cpu1  :  0.0 us,  0.0 sy,  0.0 ni, 95.6 id,  0.0 wa,  0.0 hi,  4.4 si,  0.0 st
...

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
    7 root      20   0       0      0      0 S   0.3  0.0   0:01.64 ksoftirqd/0
   16 root      20   0       0      0      0 S   0.3  0.0   0:01.97 ksoftirqd/1
 2663 root      20   0  923480  28292  13996 S   0.3  0.3   4:58.66 docker-containe
 3699 root      20   0       0      0      0 I   0.3  0.0   0:00.13 kworker/u4:0
 3708 root      20   0   44572   4176   3512 R   0.3  0.1   0:00.07 top
    1 root      20   0  225384   9136   6724 S   0.0  0.1   0:23.25 systemd
    2 root      20   0       0      0      0 S   0.0  0.0   0:00.03 kthreadd
...

Did you notice any abnormal phenomena here? Let’s go through each line starting from the first line:

The average load is all 0 and there is only one process in the run queue (1 running).
The usage of each CPU is quite low, with the highest CPU1 usage at only 4.4%, which is not high.
Looking at the process list, the highest CPU usage process is only 0.3%, which is not high either.

So why is the system responding slowly? Since the values of each metric are not high, let’s take a closer look at the specific meanings of these metrics. After all, even if it is the same metric, it may correspond to different performance issues in different parts and scenarios of the system.

Carefully observe the output of top. Although the CPU usage of both CPUs is only 3.3% and 4.4% respectively, they are both used for soft interrupts. And from the process list, we can see that the process with the highest CPU usage is the soft interrupt process ksoftirqd. It seems that soft interrupts are suspicious.

According to the previous section, since soft interrupts may be the problem, you need to know which type of soft interrupt is causing the issue. Stop and think about what method we used in the previous section to determine the type of soft interrupt. That’s right, it’s still the proc file system. By observing the content of the /proc/softirqs file, you can find out the number of soft interrupts for each type. However, what time period do these various soft interrupt counts refer to? They represent the cumulative interrupt counts since the system started running. So when we directly view the contents of the file, we only get the cumulative interrupt counts, which do not have direct reference significance to the question at hand. What we need to pay attention to is the rate of change of these interrupt counts.

So what tools can we use to observe the changes in command output? I believe you’ll remember the watch command we used in the previous example, which can run a command periodically to view its output. By adding the -d option, we can highlight the changed parts, and from the highlighted parts, we can intuitively see which contents are changing faster.

For example, in the first terminal, we run the following command:

$ watch -d cat /proc/softirqs
                        CPU0       CPU1
              HI:          0          0
           TIMER:    1083906    2368646
          NET_TX:         53          9
          NET_RX:    1550643    1916776
           BLOCK:          0          0
        IRQ_POLL:          0          0
         TASKLET:     333637       3930
           SCHED:     963675    2293171
         HRTIMER:          0          0
             RCU:    1542111    1590625

By observing the changes in the content of the /proc/softirqs file, you can see that the soft interrupts for TIMER (timer interrupts), NET_RX (network receive), SCHED (kernel scheduling), and RCU (RCU lock) are all constantly changing.

Among them, NET_RX, which represents the soft interrupts for network packet reception, has the fastest rate of change. Other types of soft interrupts, such as TIMER, SCHED, and RCU, are necessary to ensure the normal operation of Linux scheduling, timekeeping, and critical section protection, so it is normal for them to have some changes.

Next, let’s continue the analysis starting from the soft interrupts for network reception. Since it is about network reception soft interrupts, the first step should be to observe the system’s network reception situation. You might think of many network tools, but I recommend today’s protagonist: sar.

sar can be used to view the system’s network send/receive situation, and one advantage is that it can not only observe the throughput (BPS, bytes per second) of network send/receive, but also the number of frames sent/received per second (PPS, packets per second).

In the first terminal, we run the sar command with the -n DEV option to display the network send/receive report:

# -n DEV indicates displaying the network send/receive report, with data output every 1 second
$ sar -n DEV 1
15:03:46        IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s   %ifutil
15:03:47         eth0  12607.00   6304.00    664.86    358.11      0.00      0.00      0.00      0.01
15:03:47      docker0   6302.00  12604.00    270.79    664.66      0.00      0.00      0.00      0.00
15:03:47           lo      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
15:03:47    veth9f6bbcd   6302.00  12604.00    356.95    664.66      0.00      0.00      0.00      0.05

Let me briefly explain the output interface of sar. From left to right, they are:

The first column: represents the time of the report.
The second column: IFACE represents the network interface card.
The third and fourth columns: rxpck/s and txpck/s respectively represent the number of network frames received and sent per second, which is PPS.
The fifth and sixth columns: rxkB/s and txkB/s respectively represent the number of kilobytes received and sent per second, which is BPS.
The other parameters following are mostly close to 0, and obviously do not have a direct relationship with today’s problem. You can ignore them for now. Let’s take a closer look at the output:
- For the network card eth0, the number of received network frames per second is quite large, reaching 12,607, while the number of sent network frames is relatively small, only 6,304. The number of kilobytes received per second is only 664 KB, while the number of kilobytes sent is even smaller, only 358 KB.
- The data for docker0 and veth9f6bbcd is similar to eth0, except that the sent and received data are reversed. This is caused by Linux internal bridge forwarding. For now, you don’t need to delve into it, just know that the system forwards the packets received by eth0 to the Nginx service. I will explain the specific working principle in detail in the network section later on.

From this data, have you noticed anything abnormal?

Since we suspect that it is a problem with network receive interrupts, let’s focus on eth0: the number of received PPS is relatively high, reaching 12,607, while the number of received BPS is very low, only 664 KB. From a visual point of view, the network frames should be relatively small. Let’s calculate it quickly: 664 * 1024 / 12607 = 54 bytes. This means that on average, each network frame is only 54 bytes, which is obviously very small. This is what we commonly refer to as the “small packet problem.”

So, is there a way to know what kind of network frame this is and where it comes from?

We can use tcpdump to capture packets on eth0. We already know that Nginx is listening on port 80 and the HTTP service it provides is based on the TCP protocol. Therefore, we can specify the TCP protocol and port 80 to capture packets precisely.

Next, let’s run the tcpdump command in the first terminal, specifying the eth0 network card with the -i eth0 option, and specifying the TCP protocol on port 80 with the tcp port 80 option:

# Capture only eth0, -n to not resolve protocol names and hostnames
# tcp port 80 means only capture network frames with TCP protocol and port number 80
$ tcpdump -i eth0 -n tcp port 80
15:11:32.678966 IP 192.168.0.2.18238 > 192.168.0.30.80: Flags [S], seq 458303614, win 512, length 0
...

From the output of tcpdump, you can see that:

192.168.0.2.18238 > 192.168.0.30.80 indicates that the network frame is sent from port 18238 of IP address 192.168.0.2 to port 80 of IP address 192.168.0.30, which means the network frame is sent from the machine running hping3 to the machine where Nginx is located.
Flags [S] indicates that this is a SYN packet.

Combining this with the previously discovered phenomenon of PPS exceeding 12,000, we can now confirm that this is a SYN FLOOD attack sent from IP address 192.168.0.2.

At this point, we have completed a full performance diagnosis and analysis. Starting from the observation of the high utilization of soft interrupts in the system, we determined that the soft interrupt type is network receive interrupt by observing the changes in the /proc/softirqs file. Then, by using sar and tcpdump, we confirmed that this is a SYN FLOOD problem.

The simplest solution to SYN FLOOD is to block the source IP from the switch or hardware firewall, so that SYN FLOOD frames will not be sent to the server.

As for the principle of SYN FLOOD and more resolution strategies, you don’t need to pay too much attention to them for now. We will learn about them in the network chapter later on.

After finishing the case, don’t forget to clean up. Remember to stop the Nginx service that was initially started as well as the hping3 command.

In the first terminal, run the following command to stop Nginx:

# Stop the Nginx service
$ docker rm -f nginx

Then, in the second terminal, press Ctrl+C to stop hping3.

Summary #

High CPU usage of softirq (software interrupt) is a common performance issue. Although there are many types of softirq, in actual production, most of the performance bottlenecks we encounter are related to network receive softirqs, especially for network reception.

When encountering such problems, you can utilize tools such as sar and tcpdump for further analysis. Do not be afraid of network performance; later on, I will teach you more analysis methods.

Reflections #

Finally, I would like to discuss with you the soft interrupt issues you encountered. What type of soft interrupt problem did you encounter? Was it a small packet problem like the one in this case? How did you analyze their source and solve them? You can summarize your own thoughts and feelings by referring to today’s case. If you have encountered other problems, feel free to leave a message for us to solve together.

Feel free to discuss with me in the comments section, and also feel free to share this article with your colleagues and friends. Let’s practice in real scenarios and improve through communication.