27 Case Study Why Is My Disk Io Latency High

27 Case Study Why is My Disk IO Latency High #

Hello, I’m Ni Pengfei.

In the previous section, we studied a case where excessive logging caused an I/O performance issue. Let’s briefly review that.

Logs are the most commonly used and effective tools for understanding the internal operation of an application. Logs are generally divided into different levels such as debugging, information, warning, and error.

Typically, in a production environment, only the warning level logs are enabled, which generally does not cause I/O issues. However, when troubleshooting occasional problems, we may need to enable debugging logs. After debugging is finished, it is easy to forget to switch the log level back. At this point, a large amount of debugging logs may cause I/O performance problems.

You can use iostat to confirm if there is an I/O bottleneck. Then use strace and lsof to locate the application and the path to the log file it is writing to. Finally, adjust the log level through the application’s interface to perfectly solve the I/O problem.

However, if the application does not have the functionality to dynamically adjust the log level, you will need to modify the application configuration and restart it to make the configuration take effect.

Today, let’s look at a new case. This time, it is a web application based on the Python Flask framework. It provides an API for querying word popularity, but the response speed of the API is not satisfactory.

Many thanks to Dong Guoxing, a senior backend engineer from Ctrip Systems Development Department, for providing today’s case.

Case Preparation #

This case is still based on Ubuntu 18.04 and is also applicable to other Linux systems. The environment I used for this case is as follows:

Machine configuration: 2 CPUs, 8GB memory
Pre-installation of tools such as docker and sysstat, e.g., apt install docker.io sysstat

To facilitate the running of today’s case, I have packaged it into a Docker image. This way, you only need to run Docker commands to start it.

Today’s case requires two virtual machines, one of which is the target machine for case analysis, running a Flask application with the IP address 192.168.0.10. The other machine serves as the client, requesting word popularity. I have drawn a diagram to represent their relationship, as shown below:

Next, open two terminals and SSH into these two virtual machines respectively, and install the above-mentioned tools on the first virtual machine.

As before, all commands in the case are assumed to be run as the root user. If you log in to the system with a regular user account, please run the command “sudo su root” to switch to the root user.

With this, the preparation work is complete. Let’s now move on to the operational phase.

Friendly reminder: The core logic of the Python application in the case is relatively simple, and you may be able to see the problem at a glance. However, the actual source code in a production environment is much more complicated. Therefore, I still recommend not looking at the source code before performing the operation, to avoid preconceptions, and instead analyze it as a black box. This way, you can better grasp how to analyze the bottleneck location in the application based on system resource usage issues, as well as the approximate position of the bottleneck in the application.

Case Study #

First, we execute the following command in the first terminal to run the target application for this case study:

$ docker run --name=app -p 10000:80 -itd feisky/word-pop

Then, in the second terminal, we run the curl command to access http://192.168.0.10:1000/ and confirm that the case is running normally. You should be able to see a “hello world” output in the curl output screen:

$ curl http://192.168.0.10:10000/
hello world

Next, in the second terminal, we access the word popularity interface of the case application, which is http://192.168.0.10:1000/popularity/word.

$ curl http://192.168.0.10:1000/popularity/word

After waiting for a while, you will find that this interface hasn’t responded for such a long time. What’s going on? Let’s go back to the first terminal to analyze.

Let’s try to execute a random command in the first terminal, such as the df command, to check the file system usage. Strangely, such a simple command also takes a long time to output.

$ df
Filesystem     1K-blocks    Used Available Use% Mounted on
udev             4073376       0   4073376   0% /dev
tmpfs             816932    1188    815744   1% /run
/dev/sda1       30308240 8713640  21578216  29% /

By using df, we know that the system has enough disk space. So why is the response slow? It seems like we need to observe the specific usage of system resources, such as CPU, memory, and disk I/O.

The approach here is actually similar to the previous case study. We can first use top to observe the CPU and memory usage, and then use iostat to observe the disk I/O.

To avoid the curl request suddenly ending during the analysis process, let’s go back to the second terminal, press Ctrl+C to stop the previous application, and then put the curl command into a loop for execution. This time we also need to add a time command to observe the execution time each time:

$ while true; do time curl http://192.168.0.10:10000/popularity/word; sleep 1; done

Continue back to the first terminal to analyze performance. We run the top command in the first terminal to observe the CPU and memory usage:

$ top
top - 14:27:02 up 10:30,  1 user,  load average: 1.82, 1.26, 0.76
Tasks: 129 total,   1 running,  74 sleeping,   0 stopped,   0 zombie
%Cpu0  :  3.5 us,  2.1 sy,  0.0 ni,  0.0 id, 94.4 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  :  2.4 us,  0.7 sy,  0.0 ni, 70.4 id, 26.5 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :  8169300 total,  3323248 free,   436748 used,  4409304 buff/cache
KiB Swap:        0 total,        0 free,        0 used.  7412556 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
12280 root      20   0  103304  28824   7276 S  14.0  0.4   0:08.77 python
   16 root      20   0       0      0      0 S   0.3  0.0   0:09.22 ksoftirqd/1
1549 root      20   0  236712  24480   9864 S   0.3  0.3   3:31.38 python3

By observing the output of top, it can be seen that the iowait of both CPUs is very high. Especially for CPU0, the iowait has reached 94%, while there is still 3GB of remaining memory, which seems sufficient.

Looking further down, there is a slightly higher CPU usage for a python process in the processes section, reaching 14%. Although 14% is not a performance bottleneck, it is somewhat suspicious and may be related to the increase in iowait.

So is this python process with PID 12280 our case application?

In the first terminal, we press Ctrl+C to stop the top command, and then execute the following ps command to find the PID of the case application app.py:

$ ps aux | grep app.py
root     12222  0.4  0.2  96064 23452 pts/0    Ss+  14:37   0:00 python /app.py
root     12280 13.9  0.3 102424 27904 pts/0    Sl+  14:37   0:09 /usr/local/bin/python /app.py

From the output of ps, you can see that the process with high CPU usage is indeed our case application. However, let’s not rush to analyze the CPU issue. After all, with an iowait of 94%, the I/O problem is our top priority. Next, in terminal one, run the following iostat command:

The -d option displays I/O performance metrics.
The -x option displays extended statistics (i.e., all I/O metrics).

$ iostat -d -x 1
Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util 
loop0            0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00 
sda              0.00   71.00      0.00  32912.00     0.00     0.00   0.00   0.00    0.00 18118.31 241.89     0.00   463.55  13.86  98.40

After seeing the output of iostat again, do you remember the meaning of the performance indicators in this interface? Take a moment to recall them yourself. If you can’t remember, be sure to review the previous content or use man iostat to find out.

Once you understand the meaning of the indicators, pay attention to the output of iostat. You will notice that the I/O utilization of disk sda has reached 98%, close to saturation. Moreover, the response time for write requests is as high as 18 seconds, and the write data per second is 32 MB. Obviously, writing to the disk has encountered a bottleneck.

So how do we know which processes are causing these I/O requests? I believe you remember the pidstat we used in the previous section.

In terminal one, run the following pidstat command to observe the I/O situation of the processes:

$ pidstat -d 1 
14:39:14      UID       PID   kB_rd/s   kB_wr/s kB_ccwr/s iodelay  Command 
14:39:15        0     12280      0.00 335716.00      0.00       0  python

From the output of pidstat, we can once again see the results for PID 12280. This indicates that it is indeed the case application that is causing the I/O performance bottleneck.

At this point, you might think this is simple after all. Didn’t we learn about this case in the previous section? It’s just a matter of using strace to confirm if it is writing to a file and then using lsof to find the file corresponding to the file descriptor.

Is it really that simple? Let’s try it. Still in terminal one, execute the following strace command:

$ strace -p 12280 
strace: Process 12280 attached 
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=567708}) = 0 (Timeout) 
stat("/usr/local/lib/python3.7/importlib/_bootstrap.py", {st_mode=S_IFREG|0644, st_size=39278, ...}) = 0 
stat("/usr/local/lib/python3.7/importlib/_bootstrap.py", {st_mode=S_IFREG|0644, st_size=39278, ...}) = 0

From the strace output, you can see a lot of stat system calls, most of which are for Python files. However, please note that there are no write system calls here.

Since the output of strace is quite extensive, we can use grep to filter out the write calls, for example:

$ strace -p 12280 2>&1 | grep write

Unfortunately, there is still no output here.

Does this mean there is no performance issue at this point? If you re-execute the top and iostat commands, you will unfortunately find that the performance issue still exists.

So we have to analyze the results of strace, pidstat, and iostat together. Obviously, you should have noticed the contradiction here: iostat has already proven that there is a performance bottleneck in disk I/O, and pidstat has also proven that this bottleneck is caused by process 12280. But when tracing this process with strace, no write system calls were found.

This is strange. Could it be because the programming language used in the case is Python, and Python is an interpreted language, so we can’t find any trace? Or is it because the case is running in Docker? We’ll leave this as a mystery for you to think about.

File writes should have corresponding write system calls, but there is no trace using existing tools. At this point, you should consider switching tools. How can we find out where the file is being written?

Here, I will introduce you to a new tool called filetop. It is part of the bcc software package and is based on the eBPF (extended Berkeley Packet Filters) mechanism in the Linux kernel. It tracks read-write operations on files in the kernel and outputs the thread ID (TID), read-write size, read-write type, and file name.

You don’t need to delve into the working principle of eBPF right now. We will gradually encounter it in the future. For now, just learn how to use it.

As for the installation method for our old friend bcc, you can refer to its Github website https://github.com/iovisor/bcc. For example, in Ubuntu 16 or later versions, you can run the following command to install it:

sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 4052245BD4284CDD 
echo "deb https://repo.iovisor.org/apt/$(lsb_release -cs) $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/iovisor.list 
sudo apt-get update 
sudo apt-get install bcc-tools libbcc-examples linux-headers-$(uname -r)

After installation, all the tools provided by bcc are installed in the directory /usr/share/bcc/tools. Next, let’s use one of these tools to observe file I/O.

First, run the following commands in Terminal 1:

# Switch to the tools directory
$ cd /usr/share/bcc/tools

# -C option means don't clear the screen when outputting new content
$ ./filetop -C

The output of filetop will have 8 columns: thread ID, thread command, read/write count, read/write sizes (in KB), file type, and file name.

Within this content, you may see many dynamic linked libraries, but that is not our focus and can be ignored for now. Our main focus is on a Python application, so we should pay special attention to anything related to Python.

After observing for a while, you will notice that every once in a while, the Python application with thread ID 514 will write a large number of txt files, followed by reading them in large quantities.

So, which process does thread ID 514 belong to? We can use the ps command to find out. First, stop filetop by pressing Ctrl+C in Terminal 1, and then run the following ps command. The second column of the output is the process ID we are interested in:

$ ps -efT | grep 514
root     12280  514 14626 33 14:47 pts/0    00:00:05 /usr/local/bin/python /app.py

We see that this thread is part of the process with the case application ID 12280. Finally, take a breath of relief, but it’s not over yet. filetop only provides the file names without the file paths, so we still need to continue looking.

I’ll introduce another useful tool called opensnoop. It is also part of the bcc package and can dynamically trace the open system calls in the kernel. This way, we can find out the paths of these txt files.

Next, run the following opensnoop command in Terminal 1:

$ opensnoop

This time, from the output of opensnoop, you can see that these txt files are located in the /tmp directory. You can also see that it opens a number of files, with a number suffix starting from 0.txt and increasing to 999.txt. This is significantly more than the number we saw with filetop.

By combining filetop and opensnoop, we can further analyze the situation. We can speculate that after writing 1000 txt files, the case application reads this content into memory for processing. Let’s check if there are really 1000 files in this directory:

$ ls /tmp/9046db9e-fe25-11e8-b13f-0242ac110002 | wc -l
ls: cannot access '/tmp/9046db9e-fe25-11e8-b13f-0242ac110002': No such file or directory
0

However, after performing this operation, we find that the directory does not exist anymore. What’s going on? Let’s go back to opensnoop and observe for a while:

$ opensnoop
12280  python              6   0 /tmp/defee970-fe25-11e8-b13f-0242ac110002/261.txt
12280  python              6   0 /tmp/defee970-fe25-11e8-b13f-0242ac110002/840.txt
12280  python              6   0 /tmp/defee970-fe25-11e8-b13f-0242ac110002/136.txt

Originally, the path at this point has been changed to another directory. This indicates that these directories are dynamically generated by the application and deleted after use.

Combining all the previous analysis, we can basically conclude that the case application dynamically generates a batch of files for temporary data storage, which are then deleted after use. Unfortunately, it is these file read and write operations that have caused I/O performance bottlenecks, resulting in a very slow overall processing process.

Of course, we still need to verify this hypothesis. The old method is to view the source code of the application app.py,

@app.route("/popularity/<word>") 
def word_popularity(word): 
  dir_path = '/tmp/{}'.format(uuid.uuid1()) 
  count = 0 
  sample_size = 1000 
  
  def save_to_file(file_name, content): 
    with open(file_name, 'w') as f: 
    f.write(content) 

  try: 
    # initial directory firstly 
    os.mkdir(dir_path) 

    # save article to files 
    for i in range(sample_size): 
        file_name = '{}/{}.txt'.format(dir_path, i) 
        article = generate_article() 
        save_to_file(file_name, article) 

    # count word popularity 
    for root, dirs, files in os.walk(dir_path): 
        for file_name in files: 
            with open('{}/{}'.format(dir_path, file_name)) as f: 
                if validate(word, f.read()): 
                    count += 1 
    finally: 
        # clean files 
        shutil.rmtree(dir_path, ignore_errors=True) 

    return jsonify({'popularity': count / sample_size * 100, 'word': word})

From the source code, we can see that this case application generates a batch of temporary files in each request processing process, then reads them into memory for processing, and finally deletes the entire directory.

This is a common technique for processing large amounts of data using disk space. However, the heavy I/O requests in this case result in high disk I/O utilization.

To solve this, it is actually an algorithm optimization problem. For example, when there is sufficient memory, all data can be placed in memory for processing, thus avoiding the performance issues caused by I/O.

You can test it by accessing http://192.168.0.10:10000/popularity/word and http://192.168.0.10:10000/popular/word in terminal two, and compare the effects before and after:

$ time curl http://192.168.0.10:10000/popularity/word
{ 
  "popularity": 0.0, 
  "word": "word" 
} 
real    2m43.172s 
user    0m0.004s 
sys    0m0.007s

$ time curl http://192.168.0.10:10000/popular/word
{
  "popularity": 0.0,
  "word": "word"
}

real    0m8.810s
user    0m0.010s
sys    0m0.000s

The new interface can return in just 8 seconds, which is obviously much better than the initial 3 minutes.

Of course, this is only the first step of optimization, and the method is not perfect. Further optimization can be done. However, in practical systems, we mostly use similar methods. We first use the simplest method to solve the online problems as soon as possible, and then continue to think about better optimization methods.

Summary #

Today, we analyzed a case of slow response caused by word popularity.

First, we used top and iostat to analyze the CPU and disk usage of the system. We found a disk I/O bottleneck and determined that it was caused by the application in this case.

Next, we attempted to use the same method as in the previous case to observe the system calls of the process using strace, but unfortunately, we did not find any write system calls.

Therefore, we used new tools, such as filetop and opensnoop from the dynamic tracing toolkit bcc, to identify the problem with the application. We discovered that the root cause was a large number of read and write operations on temporary files.

Once the problem was identified, the optimization methods became relatively simple. If there is sufficient memory, the simplest method is to store the data in faster memory, eliminating the disk I/O bottleneck. Furthermore, you can also use various algorithms, such as Trie trees, to further optimize the efficiency of word processing.

Reflection #

Finally, I have a reflection question for you, which I mentioned in the article and want you to think about.

In today’s case, iostat has proven that there is a performance bottleneck in disk I/O, and pidstat also confirms that this bottleneck is caused by process number 12280. However, when tracing this process with strace, no write system calls were found.

What could be the reason for this? Could it be because the programming language used in the case, Python, is interpreted? Or is it because the case is running in Docker?

Here’s a little hint. When you find that the output of a performance tool cannot be explained, it’s best to go back and consider whether any clues have been missed in the analysis or to consult the tool manual to see if certain default options are causing it.

Feel free to discuss this with me in the comments section, and feel free to share this article with your colleagues and friends. Let’s practice in real scenarios and improve through communication.