11 How Tos How to Quickly Identify System CPU Bottlenecks

11 How-tos How to Quickly Identify System CPU Bottlenecks #

Hello, I am Ni Pengfei.

In the previous sections, I analyzed various common CPU performance issues through several examples. Through these examples, I believe you are no longer unfamiliar with and afraid of CPU performance analysis. At least, you now have a basic mindset and have also learned about a number of CPU performance analysis tools.

However, I guess you may have encountered a confusion that I once had: there are so many performance indicators for CPUs, and there are also plenty of CPU performance analysis tools. If you leave this column and enter the actual work scenario, what indicators should you observe and which performance tool should you choose?

Don’t worry, today I will summarize a “fast and accurate” bottleneck positioning strategy based on my years of performance optimization experience. I will tell you how to select indicators and performance tools in different scenarios, and how to identify performance bottlenecks.

CPU Performance Metrics #

Let’s first review the performance metrics for CPUs. You can take a piece of paper and write them down based on your memory, or you can refer to the previous article and summarize them.

First of all, the easiest one to think of should be CPU utilization, which is also the most common performance metric in real-world environments.

CPU utilization describes the percentage of non-idle time out of the total CPU time. Depending on the tasks running on the CPU, it can be divided into user CPU, system CPU, I/O wait CPU, software interrupt, and hardware interrupt.

User CPU utilization includes user state CPU utilization (user) and low-priority user state CPU utilization (nice), which represents the percentage of time the CPU spends running in user state. A high user CPU utilization usually indicates that there are busy applications.
System CPU utilization represents the percentage of time the CPU spends running in kernel state (excluding interrupts). A high system CPU utilization indicates that the kernel is busy.
I/O wait CPU utilization, commonly known as iowait, represents the percentage of time the CPU spends waiting for I/O. A high iowait usually indicates a long I/O interaction time between the system and hardware devices.
CPU utilization for software and hardware interrupts represents the percentage of time the CPU spends running software interrupt handlers and hardware interrupt handlers, respectively. A high utilization of these indicates that the system has encountered a large number of interrupts.
In addition to the above, there are also CPU steal utilization (steal) and guest CPU utilization, which are used in virtualized environments. They represent the percentage of CPU time occupied by other virtual machines and the percentage of CPU time used by guest virtual machines, respectively.

The second one that might come to mind is the average load, which refers to the average number of active processes in the system. It reflects the overall load of the system and mainly consists of three values: the average load in the past 1 minute, the past 5 minutes, and the past 15 minutes.

Ideally, the average load should be equal to the number of logical CPUs, indicating that each CPU is fully utilized. If the average load exceeds the number of logical CPUs, it means that the load is relatively heavy.

The third one, which you might not pay much attention to before learning about it, is process context switching, which includes:

Voluntary context switches caused by inability to acquire resources.
Involuntary context switches caused by forced scheduling by the system.

Context switching itself is a core function that ensures the normal operation of Linux. However, excessive context switching consumes the CPU time of the running process in saving and restoring data such as registers, kernel stacks, and virtual memory, thereby reducing the actual running time of the process and becoming a performance bottleneck.

In addition to the above metrics, there is another metric, CPU cache hit rate. Since the speed at which CPUs develop is much faster than the speed at which memory develops, the processing speed of the CPU is much faster than the access speed of the memory. Therefore, when the CPU accesses the memory, it inevitably has to wait for the response from the memory. In order to coordinate the huge performance gap between the CPU and memory, CPU caches (usually multi-level caches) were introduced.

CPU Caches

As shown in the above image, CPU caches have a speed that lies between the CPU and memory, and they cache hot memory data. Based on the increasing hot data, these caches are divided into three levels: L1, L2, and L3. L1 and L2 are commonly used in single-core CPUs, while L3 is used in multi-core CPUs.

From L1 to L3, the size of the three-level cache increases, and correspondingly, the performance decreases (although it is still much better than memory). The cache hit rate measures the reuse of CPU cache, and a higher hit rate indicates better performance.

All of these metrics are very useful and require us to master them proficiently. Therefore, I have summarized them into an image to help you classify and memorize them. You can save and print it out for easy review, or use it as a “metric selection” checklist for CPU performance analysis.

CPU Performance Metrics

Performance Tools #

Having mastered the performance indicators of the CPU, we also need to know how to obtain these indicators, that is, how to use the tools.

Do you remember which tools we used in the previous examples? Let’s review the CPU performance tools together.

First, let’s look at the case of average load. We first used uptime to check the system’s average load; and after the average load increased, we used mpstat and pidstat to observe the CPU usage of each CPU and each process, thus identifying the process causing the increase in average load, which is our stress testing tool.

Secondly, let’s look at the case of context switches. We first used vmstat to check the number of context switches and interrupts in the system; then through pidstat, we observed the voluntary and involuntary context switches of the processes; finally, through pidstat, we observed the context switches of the threads, and found the root cause of the increase in context switches, which is our baseline testing tool sysbench.

Third, let’s look at the case of increased CPU usage of a process. We first used top to check the CPU usage of the system and processes, and found that the process with increased CPU usage is php-fpm; then we used perf top to observe the call chain of php-fpm, and finally found the root cause of the increased CPU usage, which is the library function sqrt().

Fourth, let’s look at the case of increased CPU usage of the system. We first observed the increased CPU usage of the system using top, but we couldn’t find the process with high CPU usage through top and pidstat. Therefore, we reexamined the output of top and started from the processes with low CPU usage but in the Running state, and found something suspicious. Finally, we found that it was the short-lived processes causing the problem by using perf record and perf report.

In addition, for short-lived processes, I also introduced a specialized tool called execsnoop, which can monitor the external commands called by the processes in real time.

Fifth, let’s look at the case of uninterruptible processes and zombie processes. We first observed the problem of increased iowait using top, and found a large number of uninterruptible processes and zombie processes; then we used dstat to find that this was caused by disk reads, and then used pidstat to find the relevant processes. However, when we used strace to view the system calls of the processes, it failed. Eventually, we used perf to analyze the call chain of the processes and found that the root cause was direct disk I/O.

The last case is about soft interrupts. We observed that the CPU usage of soft interrupts in the system increased using top; then we checked /proc/softirqs and found several soft interrupts with fast changing rates; then we used the sar command to find that it was a problem with small network packets. Finally, we used tcpdump to find out the type and source of the network frames, and determined that it was caused by a SYN FLOOD attack.

By now, you may already be overwhelmed. It turns out that in just a few cases, we have used more than a dozen CPU performance tools, and each tool has different applicable scenarios! How can we distinguish between so many tools? And how should we choose in actual performance analysis?

In my experience, it is useful to understand them from two different perspectives and apply them flexibly.

Apply what you have learned and connect performance metrics with performance tools #

The first dimension is to start from the performance metrics of the CPU. In other words, when you want to view a specific performance metric, you need to know which tools can provide it.

Classify and understand the performance tools that provide metrics based on different performance indicators. This way, when troubleshooting performance issues, you will know which tools can provide the metrics you want, instead of blindly trying them one by one without any basis.

In fact, I have used this approach multiple times in the previous examples. For example, after using top to discover a high software interrupt CPU usage, naturally the next step is to find out the specific software interrupt types. Where can you observe the running situation of various software interrupts? Of course, it is the /proc/softirqs file in the proc file system.

Next, for example, if we find that the software interrupt type we discovered is network reception, then we need to continue thinking in the direction of network reception. What is the situation of the system’s network reception? What tools can be used to check the network reception situation? In our case, we use dstat.

Although you don’t need to remember all the tools, if you can understand the characteristics of each tool corresponding to each metric, you will definitely use them more efficiently and flexibly. Here, I have made a table of tools that provide CPU performance metrics, which will help you organize the relationships and enhance your understanding and memory. Of course, you can also use it as a “metric tool” guide.

Table of tools providing CPU performance metrics

Now, let’s move on to the second dimension.

The second dimension is to start from the tools. In other words, when you have already installed a certain tool, you need to know what metrics this tool can provide.

This is also very important in practical environments, especially in production environments, because in many cases, you do not have permission to install new tool packages, and can only maximize the use of the tools already installed in the system, which requires you to have sufficient understanding of them.

In terms of the usage of each tool, rich configuration options are generally supported. However, don’t worry, you don’t need to memorize all these configuration options. You just need to know which tools are available and what their basic functions are. When you really need to use them, you can simply look up their manuals through the man command.

Similarly, I have also compiled a table of these commonly used tools to help you differentiate and understand them. Naturally, you can also use it as a “tool metric” guide and refer to it when needed.

Table of common performance tools and their metrics

How to Quickly Analyze CPU Performance Bottlenecks #

I believe that by this point, you are already very familiar with CPU performance metrics and know which tools to use to obtain each type of metric.

Does this mean that every time you encounter CPU performance issues, you have to run all of these tools and analyze all of the CPU performance metrics?

You probably think that this kind of simple search is foolish. But don’t laugh, because that’s how I used to do it in the beginning. It is possible to collect all the metrics and analyze them all at once, which may uncover potential bottlenecks in the system.

However, this method is really inefficient! It takes time and effort, and in the face of a vast array of metrics, you may accidentally overlook some details, resulting in wasted effort. I have suffered from this many times.

Therefore, in a real production environment, we usually want to locate system bottlenecks as quickly as possible, and then optimize performance as quickly as possible, that is, we want to solve performance problems quickly and accurately.

So, is there a method to quickly and accurately identify system bottlenecks? The answer is yes.

Although there are many CPU performance metrics, you need to understand that since they all describe the CPU performance of the system, they are not completely isolated from each other, and many of the metrics are related to each other. To understand the correlation of performance metrics, you need to be familiar with the working principles of each performance metric. That’s why when I introduce each performance metric, I also explain the relevant system principles, hoping that you can remember this point.

For example, when the user CPU utilization is high, we should investigate the user state of the process rather than the kernel state. This is because the user CPU utilization reflects the CPU usage in user state, while the CPU usage in kernel state will only be reflected in the system CPU utilization.

You see, with this basic understanding, we can narrow down the scope of investigation and save time and effort.

Therefore, in order to narrow down the scope of investigation, I usually run several tools that support multiple metrics, such as top, vmstat, and pidstat. Why these three tools? Take a careful look at the following diagram, and you will understand.

In this diagram, I have listed the important CPU metrics provided by top, vmstat, and pidstat respectively, and used dashed lines to indicate the correlation between them, corresponding to the next step in performance analysis.

Through this diagram, you can see that these three commands almost cover all important CPU performance metrics, such as:

From the output of top, you can obtain various CPU utilization rates, as well as information about zombie processes and average load.
From the output of vmstat, you can obtain the number of context switches, interruptions, and the number of processes in the running state and uninterruptible state.
From the output of pidstat, you can obtain the user CPU utilization, system CPU utilization of processes, as well as voluntary and involuntary context switch information.

In addition, many of the metrics output by these three tools are interrelated. So, I have also used dashed lines to indicate their correlation. Let me give you a few examples that you may find easier to understand.

In the first example, when the user CPU utilization output by pidstat increases, it will cause an increase in the user CPU utilization output by top. Therefore, when you find a problem with the user CPU utilization output by top, you can compare it with the output of pidstat to see if it is caused by a specific process.

After identifying the processes that cause the performance issues, you need to use process analysis tools to analyze the behavior of the processes, such as using strace to analyze the system call situation, and using perf to analyze the execution of functions at various levels of the call chain.

In the second example, when the average load output by top increases, you can compare it with the number of processes in the running state and uninterruptible state output by vmstat to observe which type of process causes the increase in load.

If the number of uninterruptible processes increases, then I/O analysis needs to be done, which means using tools such as dstat or sar to further analyze the I/O situation.
If the number of running state processes increases, then you need to go back to top and pidstat to find out what these running state processes are, and then use process analysis tools to further analyze them.

In the final example, when you find that the soft interrupt CPU utilization output by top increases, you can check the changes in various types of soft interrupts in the /proc/softirqs file to determine which type of soft interrupt is causing the problem. For example, if you find that the problem is caused by network receive interrupts, you can continue to use network analysis tools such as sar and tcpdump to analyze the issue.

Note that in this diagram, I have only listed the most essential performance tools and not all of them. This is because, on the one hand, I don’t want to overwhelm you with a long list of tools. It may not be a good thing to be exposed to all or most of the core or niche tools at the beginning of your learning journey. On the other hand, I hope you can focus on the core tools first, as mastering them can solve most problems.

So, you can save this diagram as a mind map for CPU performance analysis. Start with the most essential tools, practice in a real environment using the examples I provided, and master them.

Summary #

Today, I have reviewed the common CPU performance indicators, organized the common CPU performance analysis tools, and finally summarized the approach to quickly analyze CPU performance issues.

Although there are many performance indicators for CPUs and correspondingly many performance analysis tools, once you are familiar with the meanings of various indicators, you will find that they are actually interrelated. By following this line of thought, mastering the common analysis techniques is not difficult.

Reflection #

Due to space limitations, I have only provided a few of the most common cases to help you understand the principles and analysis methods of CPU performance issues. You must have encountered many CPU performance issues that are different from these cases. I would like to chat with you about what unique CPU performance issues you have encountered and how you analyzed their bottlenecks.

Please feel free to discuss with me in the comments section, and you are also welcome to share this article with your colleagues and friends. Let’s practice in real-world scenarios and make progress through communication.