20 Analysis How to Analyze High CPU Utilization Issues

20 Analysis How to Analyze High CPU Utilization Issues #

Hello, I am Shaoyafang.

If you are an application developer, you should know how to analyze application logic and have your own set of experience in optimizing application code to improve system performance. In this lesson, we want to discuss how to expand your boundaries so that you can analyze modules outside of your code and the Linux kernel, which is practically a black box to you.

In many cases, performance issues in applications need to be resolved by analyzing kernel behavior. Therefore, the kernel provides a lot of metrics for application programs to refer to. When an application has a problem, we can see which metrics are abnormal and then do further analysis. However, these exported kernel metrics do not cover all scenarios. We may encounter more difficult problems, such as when an application has performance issues but all system metrics seem normal. I believe many people have been frustrated by this. So, when this happens, does the kernel have a problem? What is it doing? In this lesson, I will discuss how to analyze such problems.

We know that for application developers, the boundary of an application is the system call, and entering the system call means entering the Linux kernel. Therefore, to expand the boundaries of analyzing problems, you first need to know how to analyze the system call functions used by the application. For kernel developers, the boundary is also the system call, and the application is outside of the system call. If kernel developers want to expand the boundaries of analyzing problems, they also need to know how to use system calls to trace the logic of the application.

How to expand the boundaries of your problem analysis? #

As a kernel developer, my understanding of application logic is not as deep as my understanding of the kernel. However, when application developers come to me for help, even though I have no knowledge of their application logic, it does not affect my analysis of the problem because I know how to trace the application’s logic using analysis tools. After a series of tracing, I can have a general understanding of the application.

One of the tools I often use to trace application logic is strace. Strace can be used to analyze the “boundary” between applications and the kernel - system calls. With strace, we can not only understand the logic of application execution but also understand the logic of the kernel. So as an application developer, you can use this tool to expand the boundaries of your problem analysis.

Strace can trace the system calls of a process, specific system calls, and the execution time of system calls. Many times, by looking at the execution time of system calls, we can determine where the business delay occurs. For example, if we want to trace the system call situation of a multi-threaded program, we can use strace like this:

$ strace -T -tt -ff -p pid -o strace.out

However, before using strace to trace a process, I hope you can first understand the working principle of strace. This is also the purpose of this lesson: not only knowing how to use the tool but also understanding the principle of the tool, so that when problems occur, you can understand whether the tool is applicable.

Understand the principle of the tool, not just how to use it #

The principle of the strace tool is illustrated in the diagram below (using the command we mentioned earlier as an example):

From the diagram, we can see that for a running process, strace can attach to the target process. This is achieved through the ptrace system call (gdb also operates in a similar way). The PTRACE_SYSCALL of ptrace is used to trace the system calls of the target process. After the target process is traced, every time it enters a syscall, it will generate a SIGTRAP signal and pause execution. The tracer can know that the target process has entered a system call through the SIGTRAP signal triggered by the target process, and then the tracer would process this system call. The information output observed by the strace command is the result of this processing. After the tracer finishes processing the system call, it will resume the execution of the target process. The resumed target process will continue executing until the next system call.

You may notice that the target process is interrupted every time it makes a system call, the target process can only continue execution after strace finishes processing. This will introduce noticeable delays to the target process. Therefore, I do not recommend using this command in a production environment. If you need to use this command to trace issues in a production environment, be sure to have a contingency plan.

Assuming we have traced with strace and found that the thread latency jitter is caused by a long-running system call, how should we continue tracing? This is the moment when application developers and operations personnel need to expand the boundaries of their analysis. For kernel developers, this is just the beginning of problem analysis.

Learn to use commonly used analysis tools for kernel developers #

Let’s use a practical case to illustrate. Once, the business developers reported that they found through strace that the pread(2) system call in the business took a long time, sometimes as long as several tens of milliseconds (ms) or even seconds, but they were not sure how to analyze it further, so they asked me to help analyze it.

Since it has been determined that the problem is caused by the pread(2) system call, the subsequent analysis is relatively easy for kernel developers. The most suitable tool for analyzing this type of problem is ftrace. We can use the function_trace feature of ftrace to trace where the pread(2) system call is taking so much time.

To trace where pread(2) takes a long time, we need to know what kernel function corresponds to this system call. There are two ways we can easily obtain the kernel function corresponding to the system call:

You can see that there are multiple functions related to pread, but since our system is 64-bit, we only need to focus on the 64-bit related system calls. So we focus on the ksys_pread64 and sys_read64 functions. According to the comments in this header file, we know that the former is used by the kernel and the latter is exported to users. So in the kernel, we need to trace the former. Also, please note that the functions corresponding to different kernel versions may be inconsistent. We use the latest kernel code (5.9-rc) as an example here.

  • Use the /proc/kallsyms file to search:

$ cat /proc/kallsyms | grep pread64- …- ffffffffa02ef3d0 T ksys_pread64- …

Each line in /proc/kallsyms is a symbol, with the first column indicating the symbol address, the second column indicating the symbol attributes, and the third column indicating the symbol name. For example, in the above information, T represents a global code symbol, and we can trace such symbols. You can refer to the man nm page to understand the meanings of these symbol attributes.

Next, we will use the function_graph feature of ftrace to trace the ksys_pread64 function and see exactly where the kernel is taking so long. The usage of function_graph is as follows:

# First, set the function to trace
$ echo ksys_pread64 > /sys/kernel/debug/tracing/set_graph_function

# Then set the PID of the thread to trace. If there are multiple threads, each thread needs to be written one by one
$ echo 6577 > /sys/kernel/debug/tracing/set_ftrace_pid
$ echo 6589 >> /sys/kernel/debug/tracing/set_ftrace_pid

# Set function_graph as the current tracer to trace the function call chain
$ echo function_graph > /sys/kernel/debug/tracing/current_trace

Then we can use /sys/kernel/debug/tracing/trace_pipe to view the output. Here is the time-consuming trace I obtained:

We can see that pread(2) is blocked in the io_schedule() function for 102ms. io_schedule() means that the thread is scheduled away due to I/O blocking and needs to wait for the I/O to complete before it can continue execution. In the function_graph output, we can also see how pread(2) gradually reaches io_schedule. Since the entire process is quite long, I will only paste the key call logic here:

     21)               |            __lock_page_killable() {
     21)   0.073 us    |              page_waitqueue();
     21)               |              __wait_on_bit_lock() {
     21)               |                prepare_to_wait_exclusive() {
     21)   0.186 us    |                  _raw_spin_lock_irqsave();
     21)   0.051 us    |                  _raw_spin_unlock_irqrestore();
     21)   1.339 us    |                }
     21)               |                bit_wait_io() {
     21)               |                  io_schedule() {

We can see that pread(2) is called from the __lock_page_killable function. When pread(2) reads the file from the disk to a memory page, it locks the page first and then unlocks it after reading. If the page is already locked by another thread, such as being locked during I/O, pread(2) needs to wait. When the page is unlocked by the I/O thread, pread(2) can continue to read the file content into the page. In our case, when pread(2) reads the file content from the disk to a page, the page is already locked, so the thread calling pread(2) waits here. This is a reasonable kernel logic and there is no problem. Next, we need to figure out why the page was locked for so long.

Since the thread is blocked in disk I/O, we need to check the disk I/O situation of the system. We can use iostat to observe:

$ iostat -dxm 1

Here is the trace information:

Among them, sdb is the disk on which the business pread(2) reads the file. Under normal circumstances, its read and write volumes are small. However, from the above figure, we can see that the disk utilization (%util) randomly reaches a high level, close to 100%. Also, the avgrq-sz is large, indicating that there are many I/O requests in the queue. In addition, the w/s is much higher than usual. We can also see that due to the presence of a large number of I/O write operations at this time, there is severe disk I/O queuing, and the disk I/O utilization is high. Based on this information, we can infer that the reason why pread(2) takes a long time to read files from the disk is likely due to being starved by write operations. Therefore, we need to investigate who is performing the write I/O operations.

By using iotop to observe the I/O behavior, we found that there are no user threads performing I/O write operations. Almost all the write operations are performed by the kernel thread kworker, which means that the user threads write the content to the Page Cache, and then kworker synchronizes the content in these Page Caches to the disk. This involves the content of the first module of our course: how to observe the behavior of Page Cache.

Writing Analysis Tools #

If you are still unsure about how to observe the behavior of the Page Cache, I suggest you go back and carefully review the first module of our course. I won’t go into detail here. However, I would like to mention some methods that were not mentioned in the Page Cache module. These methods are used to determine which files are in memory and their sizes.

The conventional way is to use fincore and mincore, but they are both relatively inefficient. Here is a more efficient way: by writing a kernel module that traverses the inodes to view the composition of the Page Cache. The code for this module is longer, but I will briefly explain the core idea in pseudocode:

iterate_supers // Traverse superblock
  iterate_pagecache_sb // Traverse inodes in superblock
    list_for_each_entry(inode, &sb->s_inodes, i_sb_list)
      // Record the size of the pagecache for this inode
      nrpages = inode->i_mapping->nrpages; 
      /* Obtain the dentry corresponding to this inode and 
       * find the file path based on this dentry;
       * Please note that the inode may not have a corresponding 
       * dentry because the dentry may have been recycled,
       * in which case it is not possible to view the file name 
       * corresponding to this inode.
       */
      dentry = dentry_from_inode(inode); 
      dentry_path_raw(dentry, filename, PATH_MAX);

Using this method, you can not only view the files opened by the process, but also see the files that have been closed by the process but are still in memory. Therefore, this method provides a more comprehensive analysis.

By examining the content of the Page Cache, we discovered that certain specific files occupy a large amount of memory, but these files are offline business files, which means they are not important business files. Because the offline business occupies a large amount of the Page Cache, the working set of the online business is greatly reduced. As a result, pread(2) often fails to hit the Page Cache when reading file content, and needs to read the file from disk. In other words, the online business has a large number of pageins and pageouts.

At this point, the solution to the problem is clear: we can limit the size of the Page Cache for offline business to ensure the working set of the online business and prevent excessive refaults. After making these adjustments, the performance fluctuations no longer occur in the business.

Do you feel confused about the names I mentioned above? Are you unsure about the relationship between inode and Page Cache? If so, it means you haven’t studied the Page Cache module of our course carefully. I recommend going back and studying it thoroughly.

That’s all for this lesson.

Class Summary #

The content of this class is somewhat challenging for application developers and system administrators. I am discussing this difficult content in the hope that you can expand the boundaries of problem analysis. The content of this class is basically basic knowledge for kernel developers. If you don’t understand it well, it means that your understanding of the kernel is not enough, and you need to spend more time studying it carefully. I have been researching the kernel for many years, and even so, I still feel that my understanding of it is not deep enough and I need to continue learning and researching, which I have been doing.

Let’s now review the key points of this class:

  • The strace tool is the boundary between applications and the kernel. If you are an application developer and want to expand the boundaries of problem analysis, you need to understand the principle of strace and how to analyze problems discovered by strace.
  • ftrace is a powerful tool for analyzing kernel problems. You need to understand it.
  • You need to implement specific problem analysis tools based on your own problems. To better implement these analysis tools, you must master many kernel details.

Homework #

Regarding the section “Write your own analysis tool” in our class, I have an assignment for you, which is something I didn’t have the energy and time to do: please implement a feature in sysrq that can display all the R and D state tasks in the system, in order to help developers analyze high system load issues.

I have already implemented this feature in our kernel, but when pushing it to the Linux kernel, the maintainer requested that I implement it in a different way. However, due to other commitments at that time, this task was put on hold. If you implement it well, you can submit it to the Linux kernel, and you can also cc me ([email protected]). By the way, when implementing it, you can refer to my previous commit: scheduler: enhancement to show_state_filter and SysRq. Feel free to discuss it with me in the comments.

Finally, thank you for your reading. If you find the content of this class helpful, please feel free to share it with your friends.