08 Case Study Dealing with Uninterruptible Sleep and Zombie Processes in the System Part Two #
Hello, I’m Ni Pengfei.
In the previous section, I explained the meaning of Linux process states, as well as the reasons for the occurrence of uninterruptible processes and zombie processes. Let’s have a quick review.
The process states can be viewed using the ps
or top
command, which include running, idle, uninterruptible sleep, interruptible sleep, zombie, and stopped, etc. Among them, we focused on studying the uninterruptible state and zombie processes:
-
The uninterruptible state generally indicates that a process is interacting with hardware. In order to protect the consistency of process data with hardware, the system does not allow other processes or interrupts to interrupt the process.
-
A zombie process indicates that the process has exited, but its parent process has not reclaimed the resources occupied by the process.
In the last section, I used a case to show processes in these two states. By analyzing the output of the top
command, we discovered two problems:
-
First, the
iowait
is too high, resulting in an increased average system load and has reached the number of system CPUs. -
Second, the number of zombie processes keeps increasing, indicating that the application program fails to properly clean up the resources of the child processes.
I believe you have carefully considered these two problems. So, what is the truth? Let’s continue analyzing along these two problems and find the root cause.
First, please open a terminal and log in to the machine from last time. Then, execute the following command to rerun this case:
# Remove the previously started case
$ docker rm -f app
# Rerun the case
$ docker run --privileged --name=app -itd feisky/app:iowait
Analysis of iowait #
Let’s first take a look at the issue of iowait increasing.
I believe that when it comes to an increase in iowait, the first thing you would want to do is to check the I/O situation of the system. I usually take this approach as well. So what tool can be used to check the I/O situation of the system?
Here, I recommend using dstat, which is the tool we were required to install in the previous lesson. The advantage of using dstat is that it allows you to simultaneously view the usage of both CPU and I/O resources, making it easier to compare and analyze.
So, let’s run the dstat command in the terminal to observe the usage of CPU and I/O:
# Output 10 sets of data every 1 second
$ dstat 1 10
You did not select any stats, using -cdngy by default.
--total-cpu-usage-- -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai stl| read writ| recv send| in out | int csw
0 0 96 4 0|1219k 408k| 0 0 | 0 0 | 42 885
0 0 2 98 0| 34M 0 | 198B 790B| 0 0 | 42 138
0 0 0 100 0| 34M 0 | 66B 342B| 0 0 | 42 135
0 0 84 16 0|5633k 0 | 66B 342B| 0 0 | 52 177
0 3 39 58 0| 22M 0 | 66B 342B| 0 0 | 43 144
0 0 0 100 0| 34M 0 | 200B 450B| 0 0 | 46 147
0 0 2 98 0| 34M 0 | 66B 342B| 0 0 | 45 134
0 0 0 100 0| 34M 0 | 66B 342B| 0 0 | 39 131
0 0 83 17 0|5633k 0 | 66B 342B| 0 0 | 46 168
0 3 39 59 0| 22M 0 | 66B 342B| 0 0 | 37 134
From the output of dstat, we can see that whenever iowait increases (wai), the read requests (read) on the disk are also high. This indicates that the increase in iowait is related to the read requests on the disk, most likely caused by disk reads.
Now, which process is actually reading the disk? If you remember, in the previous lesson, we saw some processes in the uninterruptible state in top, which I find suspicious. Let’s try to analyze them.
Let’s continue running the top command in the same terminal as earlier to observe the processes in the D state:
# Observe for a while and press Ctrl+C to exit
$ top
...
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
4340 root 20 0 44676 4048 3432 R 0.3 0.0 0:00.05 top
4345 root 20 0 37280 33624 860 D 0.3 0.0 0:00.01 app
4344 root 20 0 37280 33624 860 D 0.3 0.4 0:00.01 app
...
We found the PID of the process in the D state from the output of top
. You can see that there are two processes in the D state in this interface, with PIDs 4344 and 4345.
Next, let’s check the disk I/O of these processes. Oh, don’t forget what tool to use. Generally, to view the resource usage of a specific process, we can use our old friend pidstat
. But this time, remember to add the -d
option to output the I/O usage.
For example, let’s take 4344 as an example, we run the following pidstat
command in the terminal, specifying the process ID with -p 4344
:
# -d displays I/O statistics, -p specifies the process ID, output 3 sets of data every 1 second
$ pidstat -d -p 4344 1 3
06:38:50 UID PID kB_rd/s kB_wr/s kB_ccwr/s iodelay Command
06:38:51 0 4344 0.00 0.00 0.00 0 app
06:38:52 0 4344 0.00 0.00 0.00 0 app
06:38:53 0 4344 0.00 0.00 0.00 0 app
In this output, kB_rd
represents the amount of data read in KB per second, kB_wr
represents the amount of data written in KB per second, and iodelay
represents the I/O delay in clock cycles. Since all these values are 0, it means that there is no read or write happening at this time, indicating that the issue is not caused by the 4344 process.
However, if you analyze the process 4345 in the same way, you will find that it also has no disk I/O.
So how do we know which process is performing disk I/O? Let’s continue using pidstat
, but this time let’s remove the process ID and observe the I/O usage of all processes.
Run the following pidstat
command in the terminal:
# Output multiple sets of data every 1 second (here we use 20 sets)
$ pidstat -d 1 20
...
06:48:46 UID PID kB_rd/s kB_wr/s kB_ccwr/s iodelay Command
06:48:47 0 4615 0.00 0.00 0.00 1 kworker/u4:1
06:48:47 0 6080 32768.00 0.00 0.00 170 app
06:48:47 0 6081 32768.00 0.00 0.00 184 app
06:48:47 UID PID kB_rd/s kB_wr/s kB_ccwr/s iodelay Command
06:48:48 0 6080 0.00 0.00 0.00 110 app
06:48:48 UID PID kB_rd/s kB_wr/s kB_ccwr/s iodelay Command
06:48:49 0 6081 0.00 0.00 0.00 191 app
06:48:49 UID PID kB_rd/s kB_wr/s kB_ccwr/s iodelay Command
06:48:50 UID PID kB_rd/s kB_wr/s kB_ccwr/s iodelay Command
06:48:51 0 6082 32768.00 0.00 0.00 0 app
06:48:51 0 6083 32768.00 0.00 0.00 0 app
In this output, you can see the kB_rd
column represents the amount of data read in KB per second, kB_wr
column represents the amount of data written in KB per second, and iodelay
column represents the I/O delay.
06:48:51 UID PID kB_rd/s kB_wr/s kB_ccwr/s iodelay Command
06:48:52 0 6082 32768.00 0.00 0.00 184 app
06:48:52 0 6083 32768.00 0.00 0.00 175 app
06:48:52 UID PID kB_rd/s kB_wr/s kB_ccwr/s iodelay Command
06:48:53 0 6083 0.00 0.00 0.00 105 app
...
After observing for a while, it can be seen that the app process is indeed performing disk reads, and it reads 32 MB of data per second. So it appears that the app is the issue. However, what kind of I/O operation is the app process performing?
Here, let's review the difference between user mode and kernel mode for processes. In order to access the disk, a process must use a system call. Therefore, our focus now is to find out the system calls made by the app process.
strace is the most commonly used tool for tracing process system calls. So, we can take the PID number of the process from the output of pidstat, for example 6082, and then run the strace command in the terminal, specifying the PID number with the -p parameter:
$ strace -p 6082
strace: attach: ptrace(PTRACE_SEIZE, 6082): Operation not permitted
Here, a strange error occurred - the strace command failed and the error message said that permission was denied. In theory, we are running all operations as the root user, so why would there be a permission issue? Take a moment to think about how you would handle this situation.
**In general, when encountering such a problem, I would first check if the process is in a normal state**. For example, continue running the ps command in the terminal and use grep to find the 6082 process we just checked:
$ ps aux | grep 6082
root 6082 0.0 0.0 0 0 pts/0 Z+ 13:43 0:00 [app] <defunct>
Sure enough, process 6082 has become a Zombie process with a Z status. Zombie processes have already exited, so it is not possible to further analyze their system calls. We will discuss how to handle Zombie processes in a moment. For now, let's continue analyzing the issue of iowait.
At this point, you should have noticed that the issue with system iowait continues, but tools like top and pidstat are no longer able to provide more information. This is when we should turn to event-based dynamic tracing tools.
You can use perf top to see if there are any new discoveries. Alternatively, like me, you can run perf record in the terminal for a while (e.g. 15 seconds), then press Ctrl+C to exit, and then run perf report to view the report:
$ perf record -g
$ perf report
Next, find the app process we are interested in, press Enter to expand the call stack, and you will get the following call graph:
![](../images/3cda3f93bb164cbb9a09706ed4d8765b.jpg)
In this graph, the swapper is a scheduling process in the kernel, which you can ignore for now.
Let’s look at the other information. It can be observed that the app is indeed reading data through the system call sys_read(). From the functions new_sync_read and blkdev_direct_IO, it can be seen that the process is performing direct reads on the disk, bypassing the system cache. Each read request is directly read from the disk, which explains the increase in iowait that we observed.
It seems that the culprit is the app itself, which is performing direct disk I/O!
The next question is easy to solve. We should analyze the code and find out where the direct read requests are being made. By examining the source code file app.c, you will indeed find that the app is opening the disk with the O_DIRECT option, which bypasses the system cache and directly reads and writes to the disk.
open(disk, O_RDONLY|O_DIRECT|O_LARGEFILE, 0755)
Directly reading and writing to the disk is friendly for I/O intensive applications (such as database systems) because you can directly control the disk I/O in the application. However, in most cases, it is still better to optimize disk I/O through the system cache. In other words, removing the O_DIRECT option will solve the problem.
app-fix1.c is the modified file, and I have also packaged it into an image file. You can run the following command to start it:
# First, remove the original application
$ docker rm -f app
# Run the new application
$ docker run --privileged --name=app -itd feisky/app:iowait-fix1
Finally, check with top again:
$ top
top - 14:59:32 up 19 min, 1 user, load average: 0.15, 0.07, 0.05
Tasks: 137 total, 1 running, 72 sleeping, 0 stopped, 12 zombie
%Cpu0 : 0.0 us, 1.7 sy, 0.0 ni, 98.0 id, 0.3 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu1 : 0.0 us, 1.3 sy, 0.0 ni, 98.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
...
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3084 root 20 0 0 0 0 Z 1.3 0.0 0:00.04 app
3085 root 20 0 0 0 0 Z 1.3 0.0 0:00.04 app
1 root 20 0 159848 9120 6724 S 0.0 0.1 0:09.03 systemd
2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd
3 root 20 0 0 0 0 I 0.0 0.0 0:00.40 kworker/0:0
...
You will find that iowait has dropped significantly to only 0.3%, indicating that the modification we made earlier has successfully fixed the high iowait issue. Great job! However, don’t forget that the zombie processes are still waiting for you. If you carefully observe the number of zombie processes, you will be disappointed to find that they are still increasing.
Zombie Processes #
Next, let’s address the issue of zombie processes. Since zombie processes occur when the parent process does not clean up the resources of its child processes, we need to find them at their root, which means finding the parent process and resolving it within the parent process.
As we mentioned earlier, the simplest way to find the parent process is to run the pstree
command:
# -a option means output command line options
# -p displays PID
# -s specifies the parent process of a given process
$ pstree -aps 3084
systemd,1
└─dockerd,15006 -H fd://
└─docker-containe,15024 --config /var/run/docker/containerd/containerd.toml
└─docker-containe,3991 -namespace moby -workdir...
└─app,4009
└─(app,3084)
After running this command, you will find that the parent process of process 3084 is 4009, which is the app
application.
So, let’s look at the code of the app
application to see if the handling of child process termination is correct, such as whether wait()
or waitpid()
is called, or if there is a registered handling function for the SIGCHLD signal.
Now let’s take a look at the source code file app-fix1.c
after fixing the high iowait issue, and find the section where the child process is created and cleaned up:
int status = 0;
for (;;) {
for (int i = 0; i < 2; i++) {
if(fork()== 0) {
sub_process();
}
}
sleep(5);
}
while(wait(&status)>0);
Loop statements are prone to errors. Can you find the problem here? Although this code seemingly calls the wait()
function to wait for the child process to terminate, it incorrectly places the wait()
outside the for
loop. In other words, the wait()
function is actually not being called. We can move it inside the for
loop to fix this.
I have placed the modified file in app-fix2.c
, and it has also been packaged as a Docker image. You can start it by running the following command:
# First, stop the app that generates zombie processes
$ docker rm -f app
# Then, start the new app
$ docker run --privileged --name=app -itd feisky/app:iowait-fix2
After starting it, let’s double-check using top
:
$ top
top - 15:00:44 up 20 min, 1 user, load average: 0.05, 0.05, 0.04
Tasks: 125 total, 1 running, 72 sleeping, 0 stopped, 0 zombie
%Cpu0 : 0.0 us, 1.7 sy, 0.0 ni, 98.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu1 : 0.0 us, 1.3 sy, 0.0 ni, 98.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
...
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3198 root 20 0 4376 840 780 S 0.3 0.0 0:00.01 app
2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd
3 root 20 0 0 0 0 I 0.0 0.0 0:00.41 kworker/0:0
...
Alright, the zombie processes (in the Z state) are gone, and iowait is now 0. Finally, all the issues have been resolved.
## Summary
In this case, I used a multi-process example to analyze the situation where the CPU usage of the system waiting for I/O (iowait%) increases.
Although this case resulted in increased iowait due to disk I/O, **high iowait does not necessarily indicate an I/O performance bottleneck. When only I/O-type processes are running in the system, iowait can also be high, but in reality, the disk read/write is far from reaching a performance bottleneck**.
Therefore, when encountering an increase in iowait, it is necessary to first use tools such as dstat and pidstat to confirm whether it is a disk I/O problem, and then identify the processes causing the I/O.
Processes waiting for I/O are generally in an uninterruptible state, so processes in the D state (i.e., uninterruptible state) found using the ps command are usually suspicious. However, in this case, after the I/O operation, the process becomes a zombie process, so it is not possible to directly analyze the system calls of this process using strace.
In this situation, we used the perf tool to analyze the system's CPU clock events and ultimately discovered that the problem was caused by direct I/O. At this point, it is easy to check the corresponding position in the source code for any issues.
Regarding zombie processes, they are relatively easy to troubleshoot. After using pstree to identify the parent process, you can examine the parent process's code and check for wait() / waitpid() calls or the registration of SIGCHLD signal handling functions.
## Reflection
Finally, I would like to invite you to discuss the issues of uninterruptible state processes and zombie processes that you have encountered. How do you analyze their root causes? And how do you solve them? In today's case study, have you made any new discoveries? You can summarize your thoughts based on my narrative.
Feel free to discuss with me in the comments, and feel free to share this article with your colleagues and friends. Let's practice in real-life scenarios and make progress through communication.
![](../images/ba0c334d95d448ef8f80af8385938f6d.jpg)