03 In-Depth Analysis of Which Resources Are Prone to Becoming Bottlenecks #

In Lesson 02, we briefly introduced some common entry points for solving performance issues. In this lesson, I will explain to you from the perspective of computer resources, which system components are prone to performance bottlenecks, and how to determine if a system component has reached its bottleneck.

The speed between various components of a computer is often very imbalanced, such as the speed difference between the CPU and the hard disk, which is even greater than that between a rabbit and a turtle. Therefore, according to the bucket theory we mentioned earlier, it can be said that this system has a bottleneck.

When there is a bottleneck in the system, it will have a significant negative impact on performance. For example, when the CPU load is particularly high, tasks will be queued and cannot be executed in a timely manner. Among them, the CPU, memory, and I/O components are often prone to becoming bottlenecks. Therefore, I will explain these three aspects separately.

CPU #

First, let’s introduce the most important computing component in the computer, the central processing unit (CPU). Regarding the CPU, we can generally:

Use the top command to observe the performance of the CPU.
Evaluate the queuing of CPU tasks through load.
Examine the busyness of the CPU through vmstat.

The specific situations are as follows.

1. Top Command - CPU Performance #

As shown in the figure below, after entering the top command, pressing the 1 key will display the performance indicators and detailed performance of each CPU core.

Drawing 0.png

CPU usage has multiple dimensions of indicators, which are explained below:

us: The percentage of CPU usage in user mode, representing the CPU consumed by running programs.
sy: The percentage of CPU usage in system mode, which needs to be used together with the vmstat command to check if there are frequent context switches.
ni: The percentage of CPU usage by high-priority applications.
wa: The percentage of CPU usage waiting for I/O devices. It is often used to judge I/O problems, and high input/output devices may indicate obvious bottlenecks.
hi: The percentage of CPU usage by hardware interrupts.
si: The percentage of CPU usage by software interrupts.
st: This value rarely changes on regular servers because it measures the impact of the host on virtual machines, which is the percentage of time the virtual machine waits for the host CPU. This often occurs on oversold cloud servers.
id: The percentage of idle CPU.

Generally, we pay more attention to the percentage of idle CPU, which can reflect the overall CPU utilization.

2. Load - CPU Task Queueing #

If we evaluate the queueing situation of CPU tasks, we need to use load to accomplish that. In addition to the top command, we can also use the uptime command to view the load situation. The effect of load is the same, displaying the values for the past 1 minute, 5 minutes, and 15 minutes, respectively.

Drawing 1.png

As shown in the above figure, taking a single-core operating system as an example, CPU resources are abstracted as a one-way road. There are three scenarios that can occur:

There are only 4 cars on the road, and the cars can pass smoothly without any obstacles. The load is approximately 0.5.
There are 8 cars on the road, just enough to safely pass end to end. In this case, the load is approximately 1.
There are 12 cars on the road. In addition to the 8 cars on the road, there are 4 cars waiting outside the road, needing to queue. In this case, the load is approximately 1.5.

What does a load of 1 represent? There are quite a few misunderstandings regarding this question.

Many people assume that when the load value reaches 1, the system load has reached its limit. This is not entirely accurate for multi-core hardware, although it is true for single-core hardware. The load value also depends on the number of CPUs. For example:

A single-core load reaching 1 means that the total load value is approximately 1.
For dual-core CPUs, if each core has a load of 1, the total load is approximately 2.
For quad-core CPUs, if each core has a load of 1, the total load is approximately 4.

Therefore, if the load reaches 10 but it is a 16-core machine, your system is far from reaching its load limit.

3. vmstat — CPU Busy Level #

To check the CPU busy level, you can use the vmstat command. The following image shows some output information from the vmstat command.

A few columns worth paying attention to are:

b: If the system has a load issue, you can look at the b column (Uninterruptible Sleep), which indicates waiting for I/O, possibly due to a high number of disk reads or writes.
si/so: Displays some usage details for swap space. Swap space has a significant impact on performance and should be closely monitored.
cs: The number of context switches per second. If there are too many context switches, it’s necessary to consider whether there are too many processes or threads running.

The specific number of context switches for each process can be obtained by checking the memory mapping file, as shown in the code snippet below:

[root@localhost ~]# cat /proc/2788/status
...
voluntary_ctxt_switches: 93950
nonvoluntary_ctxt_switches: 171204

Memory #

To understand the impact of memory on performance, you need to look at memory distribution from the operating system level.

图片1.png

After writing code, such as a C++ program, and reviewing its assembly code, if you see memory addresses that are not actual physical memory addresses, then the application is using logical memory. Those who have studied computer architecture should be familiar with this concept.

The logical address can be mapped to two memory segments: physical memory and virtual memory. The total available memory in the system is the sum of these two. For example, if you have 4GB of physical memory and allocated an 8GB swap partition, the total available memory for applications would be 12GB.

1. top command #

Drawing 4.png

As shown in the above image, let’s take a look at several memory parameters. The top command provides several columns of data. Pay attention to the three areas highlighted by boxes, explained as follows:

VIRT: This refers to virtual memory, which is generally large and doesn’t require much attention.
RES: We usually pay attention to the value in this column, as it represents the actual memory usage of a process. When monitoring, we mainly focus on this value.
SHR: This refers to shared memory, such as reusable libraries (so files) and so on.

2. CPU Cache #

Due to the significant difference in speed between the CPU and memory, a solution is to add a cache. In reality, these caches often have multiple levels, as shown in the diagram below.

Image 2.png

Many knowledge points in Java revolve around multi-threading because if a thread’s time slice spans across multiple CPUs, there will be synchronization issues.

In Java, one of the most typical knowledge points related to CPU caches is the problem of false sharing for cache lines in concurrent programming.

False sharing refers to the storage of data in cache lines in these caches. Even if you modify a very small piece of data within a cache line, the entire cache line will be refreshed. Therefore, when multiple threads modify the values of some variables, if these variables are all in the same cache line, frequent refreshing will occur, inadvertently affecting each other’s performance.

Each core of the CPU is essentially the same. Taking CPU0 as an example, you can use the following command to view its cache line size, which is generally 64.

cat /sys/devices/system/cpu/cpu0/cache/index0/coherency_line_size
cat /sys/devices/system/cpu/cpu0/cache/index1/coherency_line_size
cat /sys/devices/system/cpu/cpu0/cache/index2/coherency_line_size
cat /sys/devices/system/cpu/cpu0/cache/index3/coherency_line_size

Of course, you can also get the same result through the cpuinfo command:

# cat /proc/cpuinfo | grep cache
cache size  : 20480 KB
cache alignment  : 64
cache size  : 20480 KB
cache alignment  : 64
cache size  : 20480 KB
cache alignment  : 64
cache size  : 20480 KB
cache alignment  : 64

In versions of JDK8 and above, by enabling the -XX:-RestrictContended parameter, you can use the @sun.misc.Contended annotation for padding to avoid the problem of false sharing. The specifics will be explained in detail in Lesson 12: Parallel Optimization.

3. HugePage #

Image 3.png

Let’s review the image mentioned earlier. There is a TLB component mentioned above, which is fast but has limited capacity. It is not a bottleneck on ordinary PC machines. However, if the machine configuration is high and the physical memory is large, a large number of mapping tables will be generated, which will also reduce the CPU’s retrieval efficiency.

The traditional page size is 4KB, which is relatively small in the era of large memory. The solution is to increase the page size, such as increasing it to 2MB. This way, fewer mapping tables can be used to manage large memory. This technique of increasing the page size is called Huge Page.

Image 4.png

At the same time, HugePage is also accompanied by some side effects, such as increased competition. However, on some machines with large memory, enabling HugePage can increase performance to a certain extent.

4. Preloading #

In addition, the default behavior of some programs can also affect performance, such as the JVM’s -XX:+AlwaysPreTouch parameter.

By default, although the JVM is configured with parameters like Xmx and Xms to specify the initial and maximum size of the heap, the memory is only allocated when it is truly used. However, if the AlwaysPreTouch parameter is added, the JVM will pre-allocate all the memory at startup.

This means that although the startup may be slower, the runtime performance will be improved.

I/O #

I/O devices are perhaps the slowest components in a computer. It refers not only to hard disks, but also includes all peripheral devices. So how slow is a hard disk? Instead of delving into the implementation details of different devices, let’s look directly at their write speeds (data is not rigorously tested and is for reference only).

Drawing 8.png

As shown in the graph above, there is a significant difference between random writes and sequential writes on a regular disk, but sequential writes are still not on the same order of magnitude as CPU memory.

Buffering is still the only tool to address the speed difference, but in extreme cases, such as power failure, there is a lot of uncertainty and these buffers are prone to loss. Due to the length of this content, I will explain it in detail in Lesson 06.

1. iostat #

The most indicative measure of I/O busy status is the wa% (waiting) in the top and vmstat commands. If your application writes a large amount of logs, I/O wait may be very high.

Drawing 9.png

Many students have asked about convenient and useful tools for viewing disk I/O. One such tool is iostat. You can install it via the sysstat package.

Drawing 10.png

The detailed explanation of the metrics in the above image is as follows:

%util: We pay close attention to this value. In general, if this number exceeds 80%, it means that the I/O load is very heavy.
Device: Indicates which hard disk it is. If you have multiple disks, multiple lines will be displayed.
avgqu-sz: Average length of the request queue, similar to cars waiting at an intersection. Obviously, a smaller value is better.
await: The response time includes both queue time and service time. It has an empirical value. Generally, it should be less than 5ms. If this value exceeds 10ms, it means that the waiting time is too long.
svctm: Represents the average service time for I/O operations. You can recall the content from Lesson 01, where AVG means average. Svctm and await are closely related. If they are close, it means that I/O has almost no waiting time and the device performance is good. However, if await is much higher than the value of svctm, it means that the I/O queue waiting time is too long, and as a result, the applications running on the system will become slow.

2. Zero Copy #

Before data on the hard disk can be sent over the network, it needs to go through multiple buffer copies and multiple switches between user space and kernel space. If we can reduce some of these copying processes, efficiency can be improved, which is why zero copy was introduced.

Zero copy is a very important performance optimization technique. Common applications like Kafka and Nginx use this technique. Let’s take a look at the difference between with and without zero copy.

(1) Without zero copy

In the traditional way, as shown in the figure below, to send the content of a file via a socket, the following steps are needed:

- Copy the file content to kernel space; - Copy the content of kernel space memory to user space memory, for example, when a Java application reads a zip file; - User space writes the content to the kernel space cache; - The socket reads the content from the kernel cache and sends it out.

Figure 5.png Figure without zero copy

(2) With zero copy

There are multiple modes of zero copy, and here we use sendfile as an example. As shown in the figure below, with the support of the kernel, zero copy eliminates one step, which is the copy from kernel cache to user space. This saves memory and CPU scheduling time, making it more efficient.

Figure 6.png Figure with zero copy

Summary #

In this lesson, we learned about the three components that have the greatest impact on performance in a computer: CPU, memory, and I/O. We also delved into some commands for observing their performance, which can help us broadly guess where performance issues may occur.

However, these approaches can only assist with performance issues and cannot help us accurately pinpoint the true performance bottleneck. To do that, we need to do more in-depth investigation and collect more information.

Finally, here’s a question for you to ponder: If disk speed is so slow, why can Kafka achieve such high throughput when it operates on disk? You can discuss it in the comments section, and in the next lesson, I will explain.

If you have any other questions or doubts in your actual work, feel free to leave a comment, and I will answer them one by one.

In the upcoming Lesson 04, I will introduce a series of more in-depth tools that will help you obtain performance data and get even closer to the “source” of the problem.