04 How to Build a Performance Analysis Decision Tree and Find the Bottleneck Evidence Chain

04 How to build a performance analysis decision tree and find the bottleneck evidence chain #

Hello, I’m Gao Lou.

In the previous lesson, I introduced a complete and structured performance analysis process called the RESAR Performance Analysis Seven-Step Method. It can be applied to any performance analysis case. In this analysis process, there are two key techniques and concepts: performance analysis decision trees and performance bottleneck evidence chains. These are also the two important concepts we mentioned in Lesson 02, which are essential throughout the entire performance engineering process.

In today’s lesson, let’s learn how to build performance analysis decision trees step by step and find performance bottleneck evidence chains.

How to Build a Performance Analysis Decision Tree? #

In fact, a performance analysis decision tree is used in both performance monitoring design and performance bottleneck analysis, and we must have the mindset of a decision tree in performance bottleneck analysis. So, this is a step I must describe for you. In the analysis of the following courses, we will also use the term “performance analysis decision tree” extensively.

First of all, what is a performance analysis decision tree?

A performance analysis decision tree is a complete structured tree diagram that includes all technical components in the system architecture, all modules in the components, and the corresponding counters of the modules.

In this sentence, there are three important levels: components, modules, and counters:

In the following courses, I will also frequently use these three key words.

However, although this definition of “performance analysis decision tree” is reasonable, it still feels like it’s missing the main point, like reading a philosophical statement. But IT technology is not philosophy, so we need to break it down even further.

Building a performance analysis decision tree is a critical step in understanding a system, and overall, it can be divided into 4 steps.

Step 1: List all the components in the system architecture based on the system’s architecture.

In the system we are building in this course, the components of the overall architecture are as follows:

Based on the above figure, we can list all the components of the system as follows:

Step 2: Dive deep into each important module in the components.

Since there are too many components in our system, let’s take an important component, the operating system, as an example. Because the operating system is a very important part of performance analysis, almost all problems will be reflected in the counters of the operating system. As for other components, you can determine them based on the process I describe.

Based on the characteristics of the operating system, let’s start by drawing its important modules:

I’ve drawn six modules in the figure, and one of them is Swap. The existence of Swap is to allow the system to use the hard disk as a swap partition when there is no available memory. When Swap is used, it actually indicates that there is a performance problem. So, it is generally not recommended to use Swap in performance projects. We should solve performance problems before using Swap. However, in a production environment, if we have no choice, we have to enable Swap.

As for the other modules in the figure, they are basically what we must look at in performance analysis.

Step 3: List the counters corresponding to the modules.

When listing the counters, we need to be careful to include all the important counters of each module. We must not forget the important primary counters. If we miss them, some data may need to be rerun.

Now, let’s look at one of the important modules - CPU. We can see several important counters of the CPU through the top command:

[root@k8s-worker-8 ~]# top
top - 00:38:51 up 28 days,  4:27,  3 users,  load average: 78.07, 62.23, 39.14
Tasks: 275 total,  17 running, 257 sleeping,   1 stopped,   0 zombie
%Cpu0  :  4.2 us, 95.4 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.4 st
%Cpu1  :  1.8 us, 98.2 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu2  :  2.1 us, 97.9 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu3  :  1.0 us, 99.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st

We can see that there are nine counters in the top command, namely: us/sy/ni/id/wa/hi/si/st/load average. The first eight are CPU counters, which is undeniable. What about the last one, load average?

We often see vague descriptions about load average on search engines. Some say that if the load average is higher than the CPU, it indicates a high system load, while others say that load average is not directly related to the CPU, with different views.

As an important performance counter of the CPU, if we cannot give a very clear judgment direction when using it, then there is a big problem. So, I’m going to describe it to you in detail.

The load average is the average number of runnable and uninterruptible processes in the last 1m/5m/15m.

This statement is very accurate, but it’s also very general, and not very specific.

For the “runnable” state, we understand it relatively easily. From the data in the code block above, we can see that there is a number of tasks in the Running state. However, the runnable state is not just this number, it also includes situations where everything is ready except the CPU. That is, there is not a direct equivalence relationship between the number of tasks in the Running state in tasks and the value of the load average.

Similarly, we can see the number of running tasks in vmstat. In the proc column of vmstat, there are two parameters: r and b. Among them, r refers to the running and waiting states of processes, and it is described as follows in the man page:

r: The number of runnable processes (running or waiting for run time).

As for the uninterruptible state, we often encounter waiting for IO. Of course, it’s not just waiting for IO, but also includes memory swapping. This waiting-for-IO situation is reflected in the b column below the proc column of vmstat. The following counter is an explanation of the b column in the proc section of vmstat:

b: The number of processes in uninterruptible sleep.

So we can see that the load average is actually the sum of the “r” and “b” columns in the “proc” field of vmstat.

In fact, CPU has not only us/sy/ni/id/wa/hi/si/st/load average, but also two counters hidden in mpstat, which are “%guest” and “%gnice”.

[root@k8s-worker-8 ~]# mpstat -P ALL 2
Linux 3.10.0-1127.el7.x86_64 (k8s-worker-8) 	February 15, 2021 	_x86_64_	(4 CPU)

14:00:36  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
14:00:38  all    5.13    0.00    3.21    0.00    0.00    0.26    0.00    0.00    0.00   91.40
14:00:38    0    4.62    0.00    2.56    0.00    0.00    0.00    0.00    0.00    0.00   92.82
14:00:38    1    4.57    0.00    3.05    0.00    0.00    0.00    0.00    0.00    0.00   92.39
14:00:38    2    5.70    0.00    3.63    0.00    0.00    0.00    0.00    0.00    0.00   90.67
14:00:38    3    5.70    0.00    4.66    0.00    0.00    0.00    0.00    0.00    0.00   89.64

From the description below, we can see that it is meaningful to look at the parameters %guest and %gnice on the host machine, as they can indicate the proportion of CPU consumed by guest virtual machines. If you have multiple virtual machines on your host machine, you can use these two parameter values to see if the virtual machines are consuming too much CPU, and then check which specific virtual machine is consuming more CPU by examining the processes.

%guest Show the percentage of time spent by the CPU or CPUs to run a virtual processor.
%gnice Show the percentage of time spent by the CPU or CPUs to run a niced guest.

So, in the Linux operating system, if it is a host machine, we need to look at 11 counters. If it is a virtual machine, we only need to look at 9 counters (the first 9 in the figure):

At this point, we have listed all the counters related to the CPU. As mentioned earlier, we need to list all the corresponding counters for each module in the Linux operating system. Therefore, we also need to find the counters for other modules in a complete manner.

When we have listed all the key primary counters of the Linux operating system, we will have the following chart:

Please note that some of these counters are more critical. Based on my experience, I have highlighted the important counters in red. Of course, if you have a good understanding of the operating system, you can also create your own chart from different perspectives and with different approaches. You should know that listing counters is just a labor-intensive task, and as long as you are willing, you can list them.

Step 4: Determine the correlation between the counters.

From the above chart, we can see that although we have listed many counters, we are still unclear about the relationships between these counters. During the analysis, we need to determine the direction of the problem based on the corresponding counters. Sometimes, a single counter may not be sufficient to make a judgment, so we need multiple counters to make a joint judgment. Therefore, it is important to draw the relationship between these counters.

Based on my understanding, I have drawn the relationship between the counters in the Linux operating system as follows:

If there are too many lines, it may look messy, so I only drew a few of the most important key correlations.

So far, we have completed the drawing of the performance analysis decision tree for the Linux operating system, and all the counters are covered. However, the work is not yet finished because we still need to find suitable monitoring tools to collect real-time data from these counters.

Collecting Real-Time Data from Counters #

Please note that when collecting real-time data from counters, it may not be possible for a single monitoring tool to cover all counters. Therefore, when analyzing, we must be clear about the limitations of the monitoring tool. If a tool cannot monitor all counters, then we must use multiple tools to complement each other. For example, for monitoring the Linux operating system, the most commonly used monitoring tools are Prometheus, Grafana, and Node Exporter, like this:

This is the template we frequently use to monitor the Linux operating system. But does this template cover all the data? By comparing, we can find that although this template can cover most performance counters in the Linux operating system, it is not exhaustive. For example, it does not cover network queues, memory soft/hard errors, etc.

Therefore, before using a monitoring tool, we must compare the counters in the performance analysis decision tree with the counters in the monitoring tool. If something is missing, we need to supplement it with other monitoring tools or commands during the analysis.

Please remember that the choice of monitoring tool is not important, what is important is whether all counters are being monitored. Even if we don’t have any monitoring tools, it is still possible to monitor all counters simply by running commands. So, I hope you don’t rely too much on tools.

Up to this point, the entire performance analysis decision tree is not yet finished. Because there are other technical components in the architecture of this system, and our task is to draw the corresponding performance analysis decision tree for these technical components according to the four steps mentioned earlier, ultimately forming a complete and comprehensive diagram.

If we were to draw all the technical components and modules in our system, we would see the following diagram:

I did not expand the entire diagram because it would be too large and the end counters would not be visible. However, rest assured that in the analysis cases in the following courses, I will show you how to apply this performance analysis decision tree to analyze specific problems.

With that being said, I have fully described the performance analysis decision tree and listed all the steps for you. As the saying goes, “Give a man a fish and you feed him for a day; teach a man to fish and you feed him for a lifetime.” Therefore, I hope that after seeing these steps, you can draw a complete performance analysis decision tree for your own project.

I also want to emphasize that it is not necessary to list all levels of performance counters when we start to organize the performance analysis decision tree. Because it is possible that we may not use all the counters even after completing the entire project, and listing them all can be time-consuming. Only when we understand the meaning of the first-level counters and determine the direction of the problem, it may be necessary to look at counters at deeper levels.

So, it is important to understand the meaning of each counter. If we don’t understand the meaning of a counter and don’t know how to use it, then we won’t be able to know how to analyze.

After having the performance analysis decision tree, how do we apply it? Next, I will have to explain the evidence chain of performance bottlenecks.

How to find evidence of performance bottlenecks? #

Every time I do training or performance analysis, I always emphasize the importance of “evidence chain” in performance analysis.

If there is no evidence chain in performance analysis, the analysis approach would be leapfrogging. In simpler terms, it would be guessing based on experience or information. This kind of leapfrogging analysis approach is very error-prone. Therefore, when analyzing performance bottlenecks, we must have valid and reasonable evidence to analyze step by step.

So how do we determine this? Let me explain through an example.

Global Monitoring Analysis #

Before we get into this example, I need to emphasize one point: in performance analysis, monitoring analysis can be divided into two parts: global monitoring analysis and targeted monitoring analysis. Global monitoring analysis refers to analyzing the summary counters of the entire architecture. Only by using global monitoring counters can we see the initial symptoms of performance issues.

For example, if we want to find which line of code consumes CPU, initially, we don’t know which specific line of code it is, but it will be reflected in the CPU counters as high CPU usage. Global monitoring is used to check if CPU consumption is relatively high. When we see high CPU consumption, we can then use targeted monitoring analysis to find out which line of code is consuming high CPU. This is the approach of targeted monitoring analysis.

Therefore, in the process of performance analysis, I usually divide it into two stages: global monitoring analysis and targeted monitoring analysis. Global monitoring analysis can be done using monitoring platforms or commands.

In a project I worked on before, there was a host with 24 CPUs, and I observed the following data during the execution of a scenario:

This is what we mentioned earlier, the specific command for CPU monitoring in the performance analysis decision tree. So our next logical step is: based on the analysis application of performance bottlenecks, select the appropriate monitoring method, cover the counters that need to be monitored in the performance analysis decision tree, and then further refine the analysis.

In the above image, we can see that the %us utilization of all CPUs is not very high, %id is also not low, and there is still some remaining capacity. However, the %si (soft interrupt) for CPU 22 alone is 21.4% high. Is this soft interrupt reasonable or not?

Some might say that it is not too high, so what’s the problem? If we only look at the average value of %si, we may indeed not notice the existence of the problem. But if we carefully examine the more detailed data in the graph, we will come to a different conclusion. This is also why we listed the utilization rate of each CPU.

We say that when an application is running, if the application code consumes a lot of CPU, the %us utilization rate should increase. However, what we see above is not high %us utilization. And under normal circumstances, each CPU should be utilized, meaning the utilization rates of these CPUs should be balanced.

But in this graph, we can see that only CPU 22 has a high %si utilization rate of 21.4%. And the soft interrupt (%si) is only utilizing one out of the 24 CPUs. This soft interrupt is obviously not reasonable.

Targeted Monitoring Analysis #

Since the soft interrupt is not reasonable, we naturally want to know where this soft interrupt is going and why it is only interrupting one CPU. So we need to check the data for soft interrupts.

In the Linux operating system, there are several tools to view soft interrupts, and one important tool is /proc/softirqs. This file records the data for soft interrupts. In this example, because we have 24 CPUs, the data is relatively long. So I filtered out some CPU data and only left the data shown in the graph for analysis.

In this image, I have marked CPU 22 and its corresponding module name NET_RX. You may wonder how I found this module so quickly. This is because CPU 22 has the highest usage rate, so it naturally generates many more interrupts than other CPUs.

Since the counters in the /proc/softirqs file are cumulative values, the proportional differences between the accumulated values of other modules on each CPU are not significant, except for the NET_RX module, which has a large difference in the counted values across different CPUs. Therefore, we can make the following judgment: CPU 22 has a high usage rate because of NET_RX.

By looking at the name “NET_RX,” we can infer that this module is responsible for network data reception. So, what causes the network interrupts to be confined to only one CPU during the process of receiving network data?

We know that network reception relies on queues to cache data, so we need to check how many network queues there are. We can find this information by following the path /sys/class/net//queues/:

It is easy to see that there is indeed only one RX queue, which means that all network receive data has to go through this single queue. In this case, it is not possible to utilize multiple CPUs; only one CPU can handle the interrupts.

At this point, I want to clarify further: since the focus of this lesson is on explaining the logic of analysis, I will not delve into a more detailed analysis here. If you want a deeper understanding of the logic of network interrupts, you can look into the Linux source code and examine the net_rx_action function. However, ultimately, all interrupts are called by the do_softirqs function at the system level. Therefore, when using the perf top -g command to check CPU hotspots, you can also observe the logic I have described previously.

Now that we know there is only one network receive queue, the solution to the above problem naturally comes to mind: increase the number of queues to allow more CPUs to handle the interrupts. Since network interrupts are for transferring data from the network card to the TCP layer, increasing the number of queues will result in a faster transmission speed.

Therefore, our solution is to increase the number of queues:

In this example, I have increased the queues to 8, allowing the use of 8 CPUs for network data reception. If you want to use more CPUs, you can set up to 24 queues since we have 24 CPUs in this case. If you are using a virtual machine, you can achieve this modification by adding a queue parameter in the KVM XML configuration. If you are using a physical machine, you will need to replace the network card.

Now that we have clarified the entire analysis logic, let’s summarize the corresponding evidence chain based on this logic:

According to the logic shown in the image, when we see a high %si (softirq), we check the cat /proc/interrupts directory. In fact, there are two directories that can reflect this softirq: cat /proc/interrupts and cat /proc/softirqs. The former includes not only softirqs but also hardirqs, so if we see a high softirq, we can directly refer to cat /proc/softirqs.

Next, we need to find the corresponding module and then understand the implementation principle of this module. Finally, we provide the corresponding solution.

In this way, the complete evidence chain for this problem is obtained.

Through this example, you can see that the evidence chain of a performance bottleneck is actually a record of the complete analysis logic in the specific application process.

Please note that we do not directly view the contents of cat /proc/softirqs at the beginning because it is too targeted and specific. We must first look at the global data and then gradually proceed step by step. This is the rational approach. It is evident that the data obtained from global monitoring analysis and directed monitoring analysis are different because they represent different perspectives. You must pay extra attention to this point.

Summary #

In this class and the previous class, I summarized the entire analytical logic into the RESAR Performance Analysis Seven Steps, which is the most important core logic in performance analysis methodology. However, it is unlikely that these contents can cover all aspects, as performance analysis involves too many details to be fully explained in just two classes.

Therefore, I have described the most important parts to you. When you apply them in practice, you can follow this thinking process to implement your performance analysis decision tree and performance bottleneck evidence chain. The prerequisite is that you need to understand your system, understand your architecture, and understand the analysis logic I have explained to you.

In this class, I have explained two important topics to you: the construction of the performance analysis decision tree and the search for performance bottleneck evidence chain. These are the processes we must go through in the analysis of every performance issue. Only by implementing the decision tree and evidence chain can we be invincible in performance analysis.

Homework #

Finally, please take a moment to think about:

How to build your own performance analysis decision tree?
Give an example of a performance analysis case you have done that has a complete chain of evidence.

Feel free to discuss and exchange ideas with me in the comments section. Of course, you can also share this lesson with your friends around you. Their ideas may give you greater gains. See you in the next lesson!

About the course readers’ group

Click on the link on the course details page, scan the QR code, and you can join our course readers’ group. I hope that the communication and collision of ideas here can help you make greater progress. Looking forward to your arrival~