06 Case Study High CPU Usage but Can't Find the Application

06 Case Study High CPU Usage but Can’t Find the Application #

Hello, I’m Ni Pengfei.

In the previous section, I talked about what CPU usage is, and I taught you how to use tools like top, vmstat, and pidstat to investigate the processes causing high CPU usage. Then, I introduced the perf top tool to pinpoint issues within the application’s internal functions. However, someone left a comment saying that it seems like it’s quite easy to troubleshoot high CPU usage problems.

But can we analyze all cases of high CPU usage this way? I believe the answer is no.

Recalling the earlier content, we know that system CPU usage not only includes the running of processes in user and kernel modes but also encompasses interrupt handling, waiting for I/O, and kernel threads. Therefore, when you observe high CPU usage in the system, it doesn’t necessarily mean that you can find the corresponding process with high CPU usage.

Today, I will use a case study of a Nginx + PHP web service to analyze this situation with you.

Case Study #

Your Preparation #

Today we continue to explore the situation where the system CPU usage is high. So the preparation work for this experiment is basically the same as the preparation work for the previous lesson, except that the Docker image used in this case is different.

This case is still based on Ubuntu 18.04 and is also applicable to other Linux systems. The environment I used for this case is as follows:

  • Machine configuration: 2 CPUs, 8GB memory

  • Pre-install docker, sysstat, perf, ab and other tools, such as apt install docker.io sysstat linux-tools-common apache2-utils

Earlier, we mentioned that ab (apache bench) is a commonly used HTTP service performance testing tool, which is also used here to simulate Nginx clients. Since the configuration of Nginx and PHP is more complicated, I have packaged them into two Docker images (Nginx), so running two containers can create a simulation environment.

Note that this case requires two virtual machines, as shown in the figure below:

As you can see, one of them is used as the web server to simulate the performance issues, and the other is used as the web server client to increase pressure requests to the web service. The use of two virtual machines is to isolate each other and avoid “cross-infection”.

Next, we open two terminals, SSH to the two machines respectively, and install the above tools.

Also note that all the following commands are run by default as root users. If you are logged in to the system as a regular user, run the command sudo su root to switch to the root user.

Up to this point, the preparation work is completed. Next, we officially enter the operational phase.

Tips: The core logic of the PHP application in the case is relatively simple, and you may be able to identify the problem at a glance. However, the source code in the actual production environment is much more complex. Therefore, I still recommend that you do not check the source code before operating in order to avoid preconceived notions. Instead, treat it as a black box for analysis. This way, you can better understand how to analyze the bottleneck locations in the application from the perspective of system resource utilization.

Operations and Analysis #

First, in the first terminal, execute the following command to run Nginx and PHP applications:

$ docker run --name nginx -p 10000:80 -itd feisky/nginx:sp
$ docker run --name phpfpm -itd --network container:nginx feisky/php-fpm:sp

Then, in the second terminal, use curl to access http://[VM1’s IP]:10000 to confirm that Nginx has been started normally. You should be able to see the response “It works!”.

# 192.168.0.10 is the IP address of the first virtual machine
$ curl http://192.168.0.10:10000/
It works!

Next, let’s test the performance of this Nginx service. Run the following ab command in the second terminal. Note that unlike the previous operation, this time we need to test the performance of Nginx with 100 concurrent requests, with a total of 1000 requests.

# Test the performance of Nginx with 100 concurrent requests and a total of 1000 requests
$ ab -c 100 -n 1000 http://192.168.0.10:10000/
This is ApacheBench, Version 2.3 <$Revision: 1706008 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, 
...
Requests per second:    87.86 [#/sec] (mean)
Time per request:       1138.229 [ms] (mean)
...

From the output of ab, we can see that the average number of requests per second that Nginx can handle is only slightly above 87. It seems that its performance is a bit poor. So, where exactly is the problem? Let’s observe with top and pidstat.

This time, in the second terminal, we change the number of concurrent requests for testing to 5, and set the request duration to 10 minutes (-t 600). In this way, when you use performance analysis tools in the first terminal, the pressure on Nginx will continue.

Continue to run the ab command in the second terminal:

$ ab -c 5 -t 600 http://192.168.0.10:10000/

Then, in the first terminal, run the top command to observe the CPU usage of the system:

$ top
...
%Cpu(s): 80.8 us, 15.1 sy,  0.0 ni,  2.8 id,  0.0 wa,  0.0 hi,  1.3 si,  0.0 st
...

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 6882 root      20   0    8456   5052   3884 S   2.7  0.1   0:04.78 docker-containe
 6947 systemd+  20   0   33104   3716   2340 S   2.7  0.0   0:04.92 nginx
 7494 daemon    20   0  336696  15012   7332 S   2.0  0.2   0:03.55 php-fpm
 7495 daemon    20   0  336696  15160   7480 S   2.0  0.2   0:03.55 php-fpm
10547 daemon    20   0  336696  16200   8520 S   2.0  0.2   0:03.13 php-fpm
10155 daemon    20   0  336696  16200   8520 S   1.7  0.2   0:03.12 php-fpm
10552 daemon    20   0  336696  16200   8520 S   1.7  0.2   0:03.12 php-fpm
15006 root      20   0 1168608  66264  37536 S   1.0  0.8   9:39.51 dockerd
 4323 root      20   0       0      0      0 I   0.3  0.0   0:00.87 kworker/u4:1
...

By observing the process list output by top, we can see that the process with the highest CPU usage is only 2.7%, which doesn’t seem very high.

However, if we look at the system’s overall CPU usage (%Cpu) line, we will notice that the overall CPU usage is relatively high: the user CPU usage (us) has reached 80%, system CPU usage (sy) is at 15.1%, and the idle CPU (id) is only 2.8%.

Why is the user CPU usage so high? Let’s reanalyze the process list to see if there are any suspicious processes:

  • The docker-containerd process is used to run containers, and 2.7% CPU usage seems normal;
  • Nginx and php-fpm are running web services, so it’s not surprising that they consume some CPU, and 2% CPU usage is not considered high;
  • Looking at the rest of the processes, they only have 0.3% CPU usage, which doesn’t seem like it would cause the user CPU usage to reach 80%.

This is strange. Despite the user CPU usage being at 80%, we have analyzed each process in the list and still cannot find a process with high CPU usage. It seems like top is not helpful in this case. Are there other tools to view the CPU usage of processes? Do you remember our old friend pidstat? It can be used to analyze the CPU usage of processes.

Next, let’s run the pidstat command on the first terminal:

# Output a set of data every 1 second (press Ctrl+C to end)
$ pidstat 1
...
    04:36:24      UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
    04:36:25        0      6882    1.00    3.00    0.00    0.00    4.00     0  docker-containe
    04:36:25      101      6947    1.00    2.00    0.00    1.00    3.00     1  nginx
    04:36:25        1     14834    1.00    1.00    0.00    1.00    2.00     0  php-fpm
    04:36:25        1     14835    1.00    1.00    0.00    1.00    2.00     0  php-fpm
    04:36:25        1     14845    0.00    2.00    0.00    2.00    2.00     1  php-fpm
    04:36:25        1     14855    0.00    1.00    0.00    1.00    1.00     1  php-fpm
    04:36:25        1     14857    1.00    2.00    0.00    1.00    3.00     0  php-fpm
    04:36:25        0     15006    0.00    1.00    0.00    0.00    1.00     0  dockerd
    04:36:25        0     15801    0.00    1.00    0.00    0.00    1.00     1  pidstat
    04:36:25        1     17084    1.00    0.00    0.00    2.00    1.00     0  stress
    04:36:25        0     31116    0.00    1.00    0.00    0.00    1.00     0  atopacctd
    ...

After observing for a while, you may have noticed that the CPU usage of all the processes is not high. The highest is only 4% for Docker and 3% for Nginx. Even if you add up the CPU usage of all the processes, it is only 21%, which is far from 80%!

When I first encountered this kind of problem, I was completely clueless: even though the user CPU usage was already as high as 80%, I couldn't find which process was causing it. Now you can think about whether you have encountered this situation. Can you still further analyze it?

Later, I found that this situation is likely due to missing some key information in the previous analysis. You can pause for a moment, go back and recheck the previous steps. Alternatively, let's go back to analyzing the output of the top command together and see if we can make new findings.

Now, let's go back to the first terminal and run the top command again and observe for a while:

$ top top - 04:58:24 up 14 days, 15:47, 1 user, load average: 3.39, 3.82, 2.74 Tasks: 149 total, 6 running, 93 sleeping, 0 stopped, 0 zombie %Cpu(s): 77.7 us, 19.3 sy, 0.0 ni, 2.0 id, 0.0 wa, 0.0 hi, 1.0 si, 0.0 st KiB Mem : 8169348 total, 2543916 free, 457976 used, 5167456 buff/cache KiB Swap: 0 total, 0 free, 0 used. 7363908 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 6947 systemd+ 20 0 33104 3764 2340 S 4.0 0.0 0:32.69 nginx 6882 root 20 0 12108 8360 3884 S 2.0 0.1 0:31.40 docker-containe 15465 daemon 20 0 336696 15256 7576 S 2.0 0.2 0:00.62 php-fpm 15466 daemon 20 0 336696 15196 7516 S 2.0 0.2 0:00.62 php-fpm 15489 daemon 20 0 336696 16200 8520 S 2.0 0.2 0:00.62 php-fpm 6948 systemd+ 20 0 33104 3764 2340 S 1.0 0.0 0:00.95 nginx 15006 root 20 0 1168608 65632 37536 S 1.0 0.8 9:51.09 dockerd

    15476 daemon    20   0  336696  16200   8520 S   1.0  0.2   0:00.61 php-fpm
    15477 daemon    20   0  336696  16200   8520 S   1.0  0.2   0:00.61 php-fpm
    24340 daemon    20   0    8184   1616    536 R   1.0  0.0   0:00.01 stress
    24342 daemon    20   0    8196   1580    492 R   1.0  0.0   0:00.01 stress
    24344 daemon    20   0    8188   1056    492 R   1.0  0.0   0:00.01 stress
    24347 daemon    20   0    8184   1356    540 R   1.0  0.0   0:00.01 stress
    ...
    

This time let's take a look at each line of the output from top. Hmm? The "Tasks" line looks a bit weird. There are actually 6 processes in the run queue in Running state (6 running). Isn't that a bit too many?

Recalling the parameters of the ab test, the number of concurrent requests is 5. Looking at the process list, there are also 5 instances of php-fpm, plus Nginx, so it doesn't seem strange to have 6 processes running at the same time. But is that really the case?

Looking closely at the process list, this time let's focus on the processes in the Running (R) state. Have you noticed that Nginx and all the php-fpm processes are in the Sleep (S) state, while the ones that are actually in the Running (R) state are several stress processes. These stress processes are a bit strange and require further analysis.

Let's use pidstat to analyze these processes and use the -p option to specify the PID of the processes. First, from the top results above, find the PIDs of these processes. For example, let's randomly pick one, 24344, and use the pidstat command to see its CPU usage:

$ pidstat -p 24344

16:14:55 UID PID %usr %system %guest %wait %CPU CPU Command


Strange, there is no output at all. Could it be a problem with the pidstat command? As I mentioned before, **before suspecting that a performance tool is having a problem, it's best to cross-confirm with other tools**. So what tool should we use? ps should be the simplest and easiest to use. Let's run the following command in the terminal to see the status of the process with PID 24344:

Find the process with PID 24344 from all processes #

$ ps aux | grep 24344 root 9628 0.0 0.0 14856 1096 pts/0 S+ 16:15 0:00 grep –color=auto 24344


Still no output. Now we finally see the problem, it turns out that this process no longer exists, so pidstat has no output either. Since the process is gone, the performance problem should be gone as well, right? Let's confirm with the top command again:

$ top … %Cpu(s): 80.9 us, 14.9 sy, 0.0 ni, 2.8 id, 0.0 wa, 0.0 hi, 1.3 si, 0.0 st …

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 6882 root 20 0 12108 8360 3884 S 2.7 0.1 0:45.63 docker-containe 6947 systemd+ 20 0 33104 3764 2340 R 2.7 0.0 0:47.79 nginx 3865 daemon 20 0 336696 15056 7376 S 2.0 0.2 0:00.15 php-fpm 6779 daemon 20 0 8184 1112 556 R 0.3 0.0 0:00.01 stress


Oops, we made another mistake. The result is still the same as before, with user CPU usage still as high as 80.9%, system CPU usage approaching 15%, and only 2.8% idle CPU. There are running processes like Nginx and stress.

However, we just saw that the stress process no longer exists, so why is it still running? Taking a closer look at the output of top, it turns out that the PIDs of the stress processes have changed, the original PID 24344 is gone, and now it's 6779.

What does it mean when the PID of a process changes? In my opinion, it can be either because these processes are constantly crashing and restarting, for example due to segmentation faults, configuration errors, etc., so after the process exits it may be automatically restarted by the monitoring system.

Or the second reason is that these processes are all short-lived processes, i.e. commands called by applications internally through exec. These commands generally run for a short period of time and will end, making it difficult to be detected with tools like top that have long intervals (in the case above, we coincidentally found them).

As for stress, as mentioned before, it is a commonly used stress testing tool. Its PID keeps changing, and it seems to be a short-lived process called by other processes. To continue the analysis, we need to find their parent processes.

How do we find the parent process of a process? That's right, we can use pstree to display all process relationships in a tree format:

$ pstree | grep stress |-docker-containe-+-php-fpm-+-php-fpm—sh—stress | |-3*[php-fpm—sh—stress—stress]


From here, we can see that stress is a child process called by php-fpm, and there is not just one process but three (in this case). After finding the parent process, we can now dive into the internals of the app for analysis.

First of all, of course, we should take a look at its source code. Run the following command to copy the source code of the example app to the app directory, and then use grep to check if there is any code that calls the stress command:

Copy source code to local machine #

$ docker cp phpfpm:/app .

# grep check if there are codes calling the stress command
$ grep stress -r app
app/index.php:// fake I/O with stress (via write()/unlink()).
app/index.php:$result = exec("/usr/local/bin/stress -t 1 -d 1 2>&1", $output, $status);

Found it, it turns out that the stress command is directly called in the app/index.php file.

Let's take a look at the source code of [app/index.php](https://github.com/feiskyer/linux-perf-examples/blob/master/nginx-short-process/app/index.php):

```php
$ cat app/index.php
<?php
// fake I/O with stress (via write()/unlink()).
$result = exec("/usr/local/bin/stress -t 1 -d 1 2>&1", $output, $status);
if (isset($_GET["verbose"]) && $_GET["verbose"]==1 && $status != 0) {
  echo "Server internal error: ";
  print_r($output);
} else {
  echo "It works!";
}
?>

As we can see, the source code calls the stress command for each request to simulate I/O stress. From the comment, it seems that stress tests the I/O process using write() and unlink(), so this is probably the root cause of the increased system CPU usage.

However, stress simulates I/O stress, and what we saw in the previous top output was the increase in user CPU and system CPU, with no significant increase in iowait. So what is going on? Is stress really the cause of the increased CPU usage?

Let’s continue to investigate. From the code, we can see that by adding the verbose=1 parameter to the request, we can view the output of stress. Try running the following command in a second terminal:

$ curl http://192.168.0.10:10000?verbose=1
Server internal error: Array
(
    [0] => stress: info: [19607] dispatching hogs: 0 cpu, 0 io, 0 vm, 1 hdd
    [1] => stress: FAIL: [19608] (563) mkstemp failed: Permission denied
    [2] => stress: FAIL: [19607] (394) <-- worker 19608 returned error 1
    [3] => stress: WARN: [19607] (396) now reaping child worker processes
    [4] => stress: FAIL: [19607] (400) kill error: No such process
    [5] => stress: FAIL: [19607] (451) failed run completed in 0s
)

Look at the error message “mkstemp failed: Permission denied”, and “failed run completed in 0s”. It seems that the stress command did not succeed and exited due to a permission issue. It appears that we have discovered a bug in the PHP code that calls the external stress command: it lacks the permission to create temporary files.

From here, we can speculate that due to the permission error, a large number of stress processes failed to initialize when starting, leading to an increase in user CPU usage.

Having identified the source of the problem, the next step would be to optimize it, right? Not so fast! Since it’s just a speculation, we need to confirm whether our guess is correct and whether there are indeed a large number of stress processes. What tool or metric should we use?

We have already used tools like top, pidstat, pstree, etc., but we didn’t find a large number of stress processes. So, are there any other tools we can use?

Do you remember perf mentioned in the previous article? It can be used to analyze CPU performance events, which is perfect for this case. Run the command perf record -g in the first terminal, and wait for a while (e.g., 15 seconds), then press Ctrl+C to exit. Then, run perf report to view the report:

# Record performance events, wait for about 15 seconds, then press Ctrl+C to exit
$ perf record -g

# View the report
$ perf report

This way, you will see the following performance report:

perf report

As you can see, stress occupies 77% of all CPU clock events, and the function random() in the stress call stack has the highest proportion. It seems that stress is indeed the main cause of the increased CPU usage. After that, the optimization is straightforward: fix the permission issue and reduce or remove the calls to stress to reduce the system’s CPU pressure.

Of course, in real production environments, the problems are usually more complex than this case. After finding the command line that triggers the bottleneck, you may discover that the called external command is part of the application’s core logic and cannot be easily reduced or removed.

In this case, you’ll need to continue troubleshooting why the called command is causing increased CPU usage or I/O. I’ll dive into detailed analysis of these complex scenarios in the subsequent comprehensive case studies.

Finally, when the case is completed, don’t forget to clean up the environment by running the following Docker command to stop the Nginx process used in the case:

$ docker rm -f nginx phpfpm

execsnoop #

In this case, we used tools like top, pidstat, and pstree to analyze the issue of high CPU usage in the system. We found that the increase in CPU usage was caused by the short-lived process stress. However, the entire analysis process was quite complex. Is there a better way to monitor such problems?

execsnoop is a tool specifically designed for short-lived processes. It monitors the exec() behavior of processes in real-time using ftrace and outputs basic information about these processes, including the process PID, parent process PID, command-line arguments, and execution result.

For example, by using execsnoop to monitor the above case, you can directly obtain the parent process PID of the stress process and its command-line arguments, and you can also discover that a large number of stress processes are constantly being launched:

# Press Ctrl+C to stop
$ execsnoop
PCOMM            PID    PPID   RET ARGS
sh               30394  30393    0
stress           30396  30394    0 /usr/local/bin/stress -t 1 -d 1
sh               30398  30393    0
stress           30399  30398    0 /usr/local/bin/stress -t 1 -d 1
sh               30402  30400    0
stress           30403  30402    0 /usr/local/bin/stress -t 1 -d 1
sh               30405  30393    0
stress           30407  30405    0 /usr/local/bin/stress -t 1 -d 1
...

The ftrace used by execsnoop is a commonly used dynamic tracing technology, which is generally used to analyze the runtime behavior of the Linux kernel. I will also provide detailed introduction and instructions on using it in the upcoming lessons.

Summary #

When encountering CPU usage issues that cannot be explained by common problems, we should first consider the possibility of short-term application-related issues, such as the following two situations.

  • First, the application directly calls other binary programs, which usually have a short runtime and are not easily discovered through tools like top.

  • Second, the application itself crashes and restarts continuously, and the resource initialization during the startup process may consume a considerable amount of CPU.

For these types of processes, we can use pstree or execsnoop to find their parent process and then start troubleshooting from the parent process to identify the root cause of the issue.

Reflection #

Finally, I would like to invite you to have a chat about the CPU performance issues you have encountered. Do you have any memorable experiences that you’d like to share with me? Or, in today’s hands-on case, what problems did you encounter and how did you solve them? You can summarize your thoughts based on my discussion.

Feel free to discuss with me in the comments section, and feel free to share this article with your colleagues and friends. Let’s practice in real-world scenarios and progress through communication.