55 How Tos General Steps in Analyzing Performance Issues

55 How-tos General Steps in Analyzing Performance Issues #

Hello, I’m Ni Pengfei.

In the previous episode, we learned about the basic approach to application monitoring. Let’s briefly review.

Application monitoring can be divided into two main categories: metric monitoring and log monitoring.

Metric monitoring mainly involves measuring performance metrics within a certain time period, and then processing, storing, and alerting through time series analysis.
Log monitoring, on the other hand, provides more detailed contextual information. It is usually achieved through the ELK stack for collection, indexing, and visualization.

In complex business scenarios involving multiple different applications, you can also build a full-chain tracing system. This allows you to dynamically trace the performance of various components in the call chain and generate a call topology graph of the entire application, thereby speeding up the identification of performance issues in complex applications.

However, if you receive an alert from the monitoring system and discover performance bottlenecks in system resources or application programs, how do you further analyze their root causes? Today, I will show you the general steps for performance analysis from the perspectives of system resource bottlenecks and application program bottlenecks.

System Resource Bottlenecks #

First, let’s take a look at system resource bottlenecks, which are also the most common performance issues.

In the comprehensive approach to system monitoring, I have introduced the USE method, which measures resource bottlenecks in terms of utilization, saturation, and errors. System resources can be divided into hardware resources and software resources.

The most common hardware resources include CPU, memory, disk and file systems, and network.
The typical software resources include file descriptors, connection tracking, socket buffer size, etc.

In this way, when you receive a monitoring system alert, you can compare these resource lists and locate them based on different metrics.

In fact, the core of the first four modules of our column is to learn to analyze performance issues caused by these resource bottlenecks. Therefore, when you encounter performance bottlenecks in system resources, all the ideas, methods, and tools from the previous modules can be used entirely.

Next, I will review the analysis steps for CPU performance, memory performance, disk and file system I/O performance, and network performance from these four aspects.

CPU Performance Analysis #

The first and most common system resource is the CPU. Regarding CPU performance analysis methods, I have already compiled an approach to quickly analyze CPU performance bottlenecks for you in the article How to Quickly Analyze CPU Bottlenecks in a System.

Do you remember this picture? By using commonly used tools such as top, vmstat, pidstat, strace, and perf to obtain CPU performance metrics, and then combining the working principles of processes and CPUs, you can quickly locate the source of CPU performance bottlenecks.

In fact, CPU performance metrics reported by tools such as top, pidstat, and vmstat are all sourced from the /proc file system (such as /proc/loadavg, /proc/stat, /proc/softirqs, etc.). These metrics should all be monitored by the monitoring system. Although not all metrics need to trigger alerts, these metrics can accelerate the location and analysis of performance issues.

For example, when you receive an alert indicating high user CPU usage on the system, you can directly query the monitoring system to identify the process causing the high CPU usage. Then, log in to the Linux server where the process is located and analyze its behavior.

You can use strace to view a summary of the process’s system calls or use tools like perf to find the hot functions of the process. You can even use dynamic tracing methods to observe the process’s current execution process until the root cause of the bottleneck is determined.

Memory Performance Analysis #

After discussing CPU performance analysis, let’s take a look at the second type of system resource, namely memory. Regarding memory performance analysis methods, I have also compiled a fast analysis approach for you in the article How to Quickly Pinpoint Memory Problems in a System.

The following picture shows a process for quickly locating memory bottlenecks. By using performance metrics output by tools like free and vmstat, we can confirm memory bottlenecks. Then, based on the type of memory issue, we can further analyze memory usage, allocation, leaks, caching, etc., and finally identify the source of the problem.

Just like CPU performance, many performance metrics of memory also come from the /proc file system (such as /proc/meminfo, /proc/slabinfo, etc.), and they should also be monitored by the monitoring system. This way, when you receive a memory alert, you can directly obtain the performance metrics mentioned in the figure above from the monitoring system, thereby expediting the process of locating performance issues.

For example, when you receive an alert for insufficient memory, you can first identify the processes that consume the most memory from the monitoring system. Then, based on the memory usage history of these processes, observe if there are any memory leaks. After determining the most suspicious process, log in to the Linux server where the process is located to analyze the memory space or memory allocation of the process, and finally understand why the process is consuming a large amount of memory.

Disk and File System I/O Performance Analysis #

Next, let’s look at the third type of system resources, which is disk and file system I/O. I have also provided a quick analysis approach for disk and file system I/O performance analysis in my article How to Quickly Analyze System I/O Bottlenecks.

Take a look at the following figure. When you use iostat and find that there is a performance bottleneck in disk I/O (such as high I/O utilization, long response time, or sudden increase in the waiting queue length), you can further confirm the source of the I/O through pidstat, vmstat, etc. Then, depending on the source, further analyze the utilization, cache, and process I/O of the file system and disk, in order to pinpoint the root cause of the I/O issue.

Similar to CPU and memory performance, many performance metrics of disks and file systems also come from the /proc and /sys file systems (such as /proc/diskstats, /sys/block/sda/stat, etc.). Naturally, they should also be monitored by the monitoring system. This way, when you receive an I/O performance alert, you can directly obtain the performance metrics mentioned in the figure above from the monitoring system, thereby expediting the performance locating process.

For example, when you find that the I/O utilization of a disk is 100%, you can first identify the process with the most I/O from the monitoring system. Then, log in to the Linux server where the process is located and use tools such as strace, lsof, perf, etc., to analyze the I/O behavior of the process. Finally, based on the principles of the application program, identify the reasons for the heavy I/O.

Network Performance Analysis #

The last type of network performance actually includes two types of resources: network interfaces and kernel resources. In my article Several Thoughts on Network Performance Optimization, I also mentioned that the analysis of network performance should start with the principles of the Linux network protocol stack. The figure below shows the basic principles of the Linux network protocol stack, including the application layer, socket interface, transport layer, network layer, and link layer.

To analyze network performance, it is natural to start with these protocol layers, observing performance metrics such as utilization, saturation, and error count. For example:

At the link layer, you can analyze the network interface’s throughput, packet loss, errors, software interrupts, and network function offloads.
At the network layer, you can analyze routing, fragmentation, overlay networks, etc.
At the transport layer, you can analyze TCP and UDP protocols, and examine metrics such as connection count, throughput, latency, retransmissions, etc.
At the application layer, you can analyze application layer protocols (such as HTTP and DNS), request rate (QPS), socket buffer, etc.

Similar to the previous types of resources, the performance metrics of the network also come from the kernel, including the /proc file system (such as /proc/net), network interfaces, and conntrack and other kernel modules. These metrics also need to be monitored by the monitoring system. This way, when you receive a network alert, you can query these performance metrics of the protocol layers from the monitoring system, thereby faster locating performance issues.

For example, when you receive an alert for network unavailability, you can search for packet loss metrics of each protocol layer from the monitoring system to confirm which protocol layer the packet loss is occurring in. Then, based on the data from the monitoring system, confirm whether there are performance bottlenecks in network bandwidth, buffer, connection tracking count, and other software and hardware components. Finally, log in to the Linux server where the problem occurs, use tools such as netstat, tcpdump, bcc, etc., to analyze the network’s sending and receiving data, and combine the network options in the kernel as well as the principles of TCP and other network protocols to identify the source of the problem.

Application Bottlenecks #

In addition to the bottlenecks mentioned above that come from network resources, there are also many bottlenecks that directly come from the application itself. The most typical performance issues of an application are decreased throughput (number of concurrent requests), increased error rates, and increased response time.

However, in my opinion, although these performance issues can vary, they can actually be categorized into three types based on their root causes. These types are resource bottlenecks, dependency service bottlenecks, and bottlenecks within the application itself.

The first type, resource bottlenecks, refers to the bottleneck that occurs when various software and hardware resources such as CPU, memory, disk, file system I/O, network, and kernel resources are constrained, leading to limitations in the running of the application. For this type of situation, we can use the various methods mentioned in the earlier section on system resource bottlenecks for analysis.

The second type is dependency service bottlenecks, which refers to performance issues of services that the application directly or indirectly calls, such as databases, distributed caches, and middleware. These issues can result in slower response time or increased error rates for the application. This type of bottleneck can be considered as a performance issue across applications, and you can quickly identify the root cause of such problems by using a full-link tracing system.

The last type is performance issues within the application itself, which includes improper handling of multithreading, deadlocks, and excessive complexity of business algorithms, among others. For this type of problem, you can narrow down the scope of the problem by observing the execution time of key processes and errors that occur during internal execution in the application, as discussed in the earlier sections on application performance metric monitoring and log monitoring.

However, since this is an internal state of the application, detailed performance data is usually not directly available externally. Therefore, it is necessary for the application to provide these metrics during its design and development phase so that the monitoring system can understand the internal operational state of the application.

If these approaches still cannot identify the bottleneck, you can use various process analysis tools mentioned in the system resource section for analysis and localization. For example:

You can use strace to observe system calls.
Use perf and flame graphs to analyze hot functions.
You can even use dynamic tracing techniques to analyze the execution status of processes.

Of course, system resources and the application itself are inherently interrelated and complementary. In reality, many resource bottlenecks are caused by the application’s own operation. For example, memory leaks in processes can result in insufficient system memory, and excessive I/O requests from processes can slow down the entire system’s I/O requests.

Therefore, in many cases, resource bottlenecks and application bottlenecks are actually caused by the same problem, and there is no need for us to analyze them separately.

Conclusion #

Today, I have outlined the general steps for analyzing performance issues from the perspectives of system resource bottlenecks and application bottlenecks.

From the perspective of system resource bottlenecks, the USE method is the most effective approach. This involves analyzing various software and hardware resources such as CPU, memory, disk and file system I/O, network, and kernel resource limitations based on utilization, saturation, and error count. I have also reviewed the analysis methods for these resources in the previous sections of our column.

From the perspective of application bottlenecks, we can categorize the sources of performance issues into three types: resource bottlenecks, dependency service bottlenecks, and application bottlenecks.

Resource bottlenecks are essentially the same as system resource bottlenecks.
For dependency service bottlenecks, you can use a full-link tracing system for localization.
As for application self-related issues, you can analyze and locate them through system calls, hot functions, or application-specific metric monitoring and log monitoring.

It is worth noting that although I have divided bottlenecks into the perspectives of system and application, in practice, these two are often complementary and mutually influencing. The system is the operating environment of the application, and system bottlenecks can lead to performance degradation of the application, and unreasonable design of the application can also cause system resource bottlenecks. Performance analysis aims to incorporate the principles of both application programs and operating systems in order to identify the true cause of the problems.

Reflection #

Lastly, I would like to invite you to have a discussion on how you usually analyze and troubleshoot performance issues. Do you have any memorable experiences to share with me? You can summarize your approach based on my discussion.

Feel free to discuss with me in the comments section, and also feel free to share this article with your colleagues and friends. Let’s practice in real-world scenarios and make progress through communication.