08 Case Study Dealing With Uninterruptible Sleep and Zombie Processes in the System Part Two

08 Case Study Dealing with Uninterruptible Sleep and Zombie Processes in the System Part Two #

Hello, I’m Ni Pengfei.

In the previous section, I explained the meaning of Linux process states, as well as the reasons for the occurrence of uninterruptible processes and zombie processes. Let’s have a quick review.

The process states can be viewed using the ps or top command, which include running, idle, uninterruptible sleep, interruptible sleep, zombie, and stopped, etc. Among them, we focused on studying the uninterruptible state and zombie processes:

The uninterruptible state generally indicates that a process is interacting with hardware. In order to protect the consistency of process data with hardware, the system does not allow other processes or interrupts to interrupt the process.
A zombie process indicates that the process has exited, but its parent process has not reclaimed the resources occupied by the process.

In the last section, I used a case to show processes in these two states. By analyzing the output of the top command, we discovered two problems:

First, the iowait is too high, resulting in an increased average system load and has reached the number of system CPUs.
Second, the number of zombie processes keeps increasing, indicating that the application program fails to properly clean up the resources of the child processes.

I believe you have carefully considered these two problems. So, what is the truth? Let’s continue analyzing along these two problems and find the root cause.

First, please open a terminal and log in to the machine from last time. Then, execute the following command to rerun this case:

# Remove the previously started case
$ docker rm -f app
# Rerun the case
$ docker run --privileged --name=app -itd feisky/app:iowait

Analysis of iowait #

Let’s first take a look at the issue of iowait increasing.

I believe that when it comes to an increase in iowait, the first thing you would want to do is to check the I/O situation of the system. I usually take this approach as well. So what tool can be used to check the I/O situation of the system?

Here, I recommend using dstat, which is the tool we were required to install in the previous lesson. The advantage of using dstat is that it allows you to simultaneously view the usage of both CPU and I/O resources, making it easier to compare and analyze.

So, let’s run the dstat command in the terminal to observe the usage of CPU and I/O:

# Output 10 sets of data every 1 second
$ dstat 1 10
You did not select any stats, using -cdngy by default.
--total-cpu-usage-- -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai stl| read  writ| recv  send|  in   out | int   csw
  0   0  96   4   0|1219k  408k|   0     0 |   0     0 |  42   885
  0   0   2  98   0|  34M    0 | 198B  790B|   0     0 |  42   138
  0   0   0 100   0|  34M    0 |  66B  342B|   0     0 |  42   135
  0   0  84  16   0|5633k    0 |  66B  342B|   0     0 |  52   177
  0   3  39  58   0|  22M    0 |  66B  342B|   0     0 |  43   144
  0   0   0 100   0|  34M    0 | 200B  450B|   0     0 |  46   147
  0   0   2  98   0|  34M    0 |  66B  342B|   0     0 |  45   134
  0   0   0 100   0|  34M    0 |  66B  342B|   0     0 |  39   131
  0   0  83  17   0|5633k    0 |  66B  342B|   0     0 |  46   168
  0   3  39  59   0|  22M    0 |  66B  342B|   0     0 |  37   134

From the output of dstat, we can see that whenever iowait increases (wai), the read requests (read) on the disk are also high. This indicates that the increase in iowait is related to the read requests on the disk, most likely caused by disk reads.

Now, which process is actually reading the disk? If you remember, in the previous lesson, we saw some processes in the uninterruptible state in top, which I find suspicious. Let’s try to analyze them.

Let’s continue running the top command in the same terminal as earlier to observe the processes in the D state:

# Observe for a while and press Ctrl+C to exit
$ top
...
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 4340 root      20   0   44676   4048   3432 R   0.3  0.0   0:00.05 top
 4345 root      20   0   37280  33624    860 D   0.3  0.0   0:00.01 app
 4344 root      20   0   37280  33624    860 D   0.3  0.4   0:00.01 app
...

We found the PID of the process in the D state from the output of top. You can see that there are two processes in the D state in this interface, with PIDs 4344 and 4345.

Next, let’s check the disk I/O of these processes. Oh, don’t forget what tool to use. Generally, to view the resource usage of a specific process, we can use our old friend pidstat. But this time, remember to add the -d option to output the I/O usage.

For example, let’s take 4344 as an example, we run the following pidstat command in the terminal, specifying the process ID with -p 4344:

# -d displays I/O statistics, -p specifies the process ID, output 3 sets of data every 1 second
$ pidstat -d -p 4344 1 3
06:38:50      UID       PID   kB_rd/s   kB_wr/s kB_ccwr/s iodelay  Command
06:38:51        0      4344      0.00      0.00      0.00       0  app
06:38:52        0      4344      0.00      0.00      0.00       0  app
06:38:53        0      4344      0.00      0.00      0.00       0  app

In this output, kB_rd represents the amount of data read in KB per second, kB_wr represents the amount of data written in KB per second, and iodelay represents the I/O delay in clock cycles. Since all these values are 0, it means that there is no read or write happening at this time, indicating that the issue is not caused by the 4344 process.

However, if you analyze the process 4345 in the same way, you will find that it also has no disk I/O.

So how do we know which process is performing disk I/O? Let’s continue using pidstat, but this time let’s remove the process ID and observe the I/O usage of all processes.

Run the following pidstat command in the terminal:

# Output multiple sets of data every 1 second (here we use 20 sets)
$ pidstat -d 1 20
...
06:48:46      UID       PID   kB_rd/s   kB_wr/s kB_ccwr/s iodelay  Command
06:48:47        0      4615      0.00      0.00      0.00       1  kworker/u4:1
06:48:47        0      6080  32768.00      0.00      0.00     170  app
06:48:47        0      6081  32768.00      0.00      0.00     184  app

06:48:47      UID       PID   kB_rd/s   kB_wr/s kB_ccwr/s iodelay  Command
06:48:48        0      6080      0.00      0.00      0.00     110  app

06:48:48      UID       PID   kB_rd/s   kB_wr/s kB_ccwr/s iodelay  Command
06:48:49        0      6081      0.00      0.00      0.00     191  app

06:48:49      UID       PID   kB_rd/s   kB_wr/s kB_ccwr/s iodelay  Command

06:48:50      UID       PID   kB_rd/s   kB_wr/s kB_ccwr/s iodelay  Command
06:48:51        0      6082  32768.00      0.00      0.00       0  app
06:48:51        0      6083  32768.00      0.00      0.00       0  app

In this output, you can see the kB_rd column represents the amount of data read in KB per second, kB_wr column represents the amount of data written in KB per second, and iodelay column represents the I/O delay.

06:48:51      UID       PID   kB_rd/s   kB_wr/s kB_ccwr/s iodelay  Command
06:48:52        0      6082  32768.00      0.00      0.00     184  app
06:48:52        0      6083  32768.00      0.00      0.00     175  app

06:48:52      UID       PID   kB_rd/s   kB_wr/s kB_ccwr/s iodelay  Command
06:48:53        0      6083      0.00      0.00      0.00     105  app
...

After observing for a while, it can be seen that the app process is indeed performing disk reads, and it reads 32 MB of data per second. So it appears that the app is the issue. However, what kind of I/O operation is the app process performing?

Here, let's review the difference between user mode and kernel mode for processes. In order to access the disk, a process must use a system call. Therefore, our focus now is to find out the system calls made by the app process.

strace is the most commonly used tool for tracing process system calls. So, we can take the PID number of the process from the output of pidstat, for example 6082, and then run the strace command in the terminal, specifying the PID number with the -p parameter:

$ strace -p 6082
strace: attach: ptrace(PTRACE_SEIZE, 6082): Operation not permitted

Here, a strange error occurred - the strace command failed and the error message said that permission was denied. In theory, we are running all operations as the root user, so why would there be a permission issue? Take a moment to think about how you would handle this situation.

**In general, when encountering such a problem, I would first check if the process is in a normal state**. For example, continue running the ps command in the terminal and use grep to find the 6082 process we just checked:

$ ps aux | grep 6082
root      6082  0.0  0.0      0     0 pts/0    Z+   13:43   0:00 [app] <defunct>

Sure enough, process 6082 has become a Zombie process with a Z status. Zombie processes have already exited, so it is not possible to further analyze their system calls. We will discuss how to handle Zombie processes in a moment. For now, let's continue analyzing the issue of iowait.

At this point, you should have noticed that the issue with system iowait continues, but tools like top and pidstat are no longer able to provide more information. This is when we should turn to event-based dynamic tracing tools.

You can use perf top to see if there are any new discoveries. Alternatively, like me, you can run perf record in the terminal for a while (e.g. 15 seconds), then press Ctrl+C to exit, and then run perf report to view the report:

$ perf record -g
$ perf report

Next, find the app process we are interested in, press Enter to expand the call stack, and you will get the following call graph:

![](../images/3cda3f93bb164cbb9a09706ed4d8765b.jpg)

In this graph, the swapper is a scheduling process in the kernel, which you can ignore for now.

Let’s look at the other information. It can be observed that the app is indeed reading data through the system call sys_read(). From the functions new_sync_read and blkdev_direct_IO, it can be seen that the process is performing direct reads on the disk, bypassing the system cache. Each read request is directly read from the disk, which explains the increase in iowait that we observed.

It seems that the culprit is the app itself, which is performing direct disk I/O!

The next question is easy to solve. We should analyze the code and find out where the direct read requests are being made. By examining the source code file app.c, you will indeed find that the app is opening the disk with the O_DIRECT option, which bypasses the system cache and directly reads and writes to the disk.

open(disk, O_RDONLY|O_DIRECT|O_LARGEFILE, 0755)

Directly reading and writing to the disk is friendly for I/O intensive applications (such as database systems) because you can directly control the disk I/O in the application. However, in most cases, it is still better to optimize disk I/O through the system cache. In other words, removing the O_DIRECT option will solve the problem.

app-fix1.c is the modified file, and I have also packaged it into an image file. You can run the following command to start it:

# First, remove the original application
$ docker rm -f app
# Run the new application
$ docker run --privileged --name=app -itd feisky/app:iowait-fix1

Finally, check with top again:

$ top
top - 14:59:32 up 19 min,  1 user,  load average: 0.15, 0.07, 0.05
Tasks: 137 total,   1 running,  72 sleeping,   0 stopped,  12 zombie
%Cpu0  :  0.0 us,  1.7 sy,  0.0 ni, 98.0 id,  0.3 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  :  0.0 us,  1.3 sy,  0.0 ni, 98.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
...

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 3084 root      20   0       0      0      0 Z   1.3  0.0   0:00.04 app
 3085 root      20   0       0      0      0 Z   1.3  0.0   0:00.04 app
    1 root      20   0  159848   9120   6724 S   0.0  0.1   0:09.03 systemd
    2 root      20   0       0      0      0 S   0.0  0.0   0:00.00 kthreadd
    3 root      20   0       0      0      0 I   0.0  0.0   0:00.40 kworker/0:0
...

You will find that iowait has dropped significantly to only 0.3%, indicating that the modification we made earlier has successfully fixed the high iowait issue. Great job! However, don’t forget that the zombie processes are still waiting for you. If you carefully observe the number of zombie processes, you will be disappointed to find that they are still increasing.

Zombie Processes #

Next, let’s address the issue of zombie processes. Since zombie processes occur when the parent process does not clean up the resources of its child processes, we need to find them at their root, which means finding the parent process and resolving it within the parent process.

As we mentioned earlier, the simplest way to find the parent process is to run the pstree command:

# -a option means output command line options
# -p displays PID
# -s specifies the parent process of a given process
$ pstree -aps 3084
systemd,1
  └─dockerd,15006 -H fd://
      └─docker-containe,15024 --config /var/run/docker/containerd/containerd.toml
          └─docker-containe,3991 -namespace moby -workdir...
              └─app,4009
                  └─(app,3084)

After running this command, you will find that the parent process of process 3084 is 4009, which is the app application.

So, let’s look at the code of the app application to see if the handling of child process termination is correct, such as whether wait() or waitpid() is called, or if there is a registered handling function for the SIGCHLD signal.

Now let’s take a look at the source code file app-fix1.c after fixing the high iowait issue, and find the section where the child process is created and cleaned up:

int status = 0;
  for (;;) {
    for (int i = 0; i < 2; i++) {
      if(fork()== 0) {
        sub_process();
      }
    }
    sleep(5);
  }

  while(wait(&status)>0);

Loop statements are prone to errors. Can you find the problem here? Although this code seemingly calls the wait() function to wait for the child process to terminate, it incorrectly places the wait() outside the for loop. In other words, the wait() function is actually not being called. We can move it inside the for loop to fix this.

I have placed the modified file in app-fix2.c, and it has also been packaged as a Docker image. You can start it by running the following command:

# First, stop the app that generates zombie processes
$ docker rm -f app
# Then, start the new app
$ docker run --privileged --name=app -itd feisky/app:iowait-fix2

After starting it, let’s double-check using top:

$ top
top - 15:00:44 up 20 min,  1 user,  load average: 0.05, 0.05, 0.04
Tasks: 125 total,   1 running,  72 sleeping,   0 stopped,   0 zombie
%Cpu0  :  0.0 us,  1.7 sy,  0.0 ni, 98.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  :  0.0 us,  1.3 sy,  0.0 ni, 98.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
...

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 3198 root      20   0    4376    840    780 S   0.3  0.0   0:00.01 app
    2 root      20   0       0      0      0 S   0.0  0.0   0:00.00 kthreadd
    3 root      20   0       0      0      0 I   0.0  0.0   0:00.41 kworker/0:0
...

Alright, the zombie processes (in the Z state) are gone, and iowait is now 0. Finally, all the issues have been resolved.
## Summary

In this case, I used a multi-process example to analyze the situation where the CPU usage of the system waiting for I/O (iowait%) increases.

Although this case resulted in increased iowait due to disk I/O, **high iowait does not necessarily indicate an I/O performance bottleneck. When only I/O-type processes are running in the system, iowait can also be high, but in reality, the disk read/write is far from reaching a performance bottleneck**.

Therefore, when encountering an increase in iowait, it is necessary to first use tools such as dstat and pidstat to confirm whether it is a disk I/O problem, and then identify the processes causing the I/O.

Processes waiting for I/O are generally in an uninterruptible state, so processes in the D state (i.e., uninterruptible state) found using the ps command are usually suspicious. However, in this case, after the I/O operation, the process becomes a zombie process, so it is not possible to directly analyze the system calls of this process using strace.

In this situation, we used the perf tool to analyze the system's CPU clock events and ultimately discovered that the problem was caused by direct I/O. At this point, it is easy to check the corresponding position in the source code for any issues.

Regarding zombie processes, they are relatively easy to troubleshoot. After using pstree to identify the parent process, you can examine the parent process's code and check for wait() / waitpid() calls or the registration of SIGCHLD signal handling functions.
## Reflection

Finally, I would like to invite you to discuss the issues of uninterruptible state processes and zombie processes that you have encountered. How do you analyze their root causes? And how do you solve them? In today's case study, have you made any new discoveries? You can summarize your thoughts based on my narrative.

Feel free to discuss with me in the comments, and feel free to share this article with your colleagues and friends. Let's practice in real-life scenarios and make progress through communication.

![](../images/ba0c334d95d448ef8f80af8385938f6d.jpg)