12 How Tos Several Thoughts on CPU Performance Optimization

12 How-tos Several Thoughts on CPU Performance Optimization #

Hello, I’m Ni Pengfei.

In the previous section, we reviewed common CPU performance metrics, organized core CPU performance observation tools, and finally summarized the approach to quickly analyze CPU performance issues. Although there are many CPU performance metrics and corresponding analysis tools, once you understand the meanings of these metrics, you will find that they are actually interconnected.

By following these relationships, you will discover that mastering these commonly used bottleneck analysis strategies is not difficult.

After identifying the performance bottleneck of the CPU, the next step is optimization, which means finding ways to fully utilize the CPU in order to accomplish more work.

Today, I will talk about the strategies and considerations for optimizing CPU performance issues.

Performance Optimization Methodology #

After going through various performance analysis methods, we finally identified the bottleneck causing the performance issue. But should we immediately start optimizing? Before we get to work, let’s consider these three questions.

Firstly, how do we determine if the optimization is effective? And how much performance improvement can we expect after optimization?
Secondly, performance issues are often not isolated. If multiple performance issues are happening simultaneously, which one should we prioritize for optimization?
Thirdly, there are not always unique methods to improve performance. When there are multiple methods to choose from, which one should we use? Is it always the one that maximizes performance improvement?

If you can easily answer these three questions, then you can start optimizing without further ado.

For example, in the case of the uninterruptible process mentioned earlier, through performance analysis, we found that the high iowait of up to 90% was caused by direct I/O. Can we immediately optimize by “replacing direct I/O with cached I/O”?

According to what was mentioned above, you can first ponder these three points. If you are not sure, let’s figure it out together.

For the first question, replacing direct I/O with cached I/O can reduce the iowait from 90% to close to 0, which results in a significant performance improvement.
For the second question, we did not find any other performance issues. Direct I/O is the only performance bottleneck, so there is no need to choose which one to optimize.
For the third question, cached I/O is the simplest optimization method we have currently, and this optimization does not affect the functionality of the application.

Alright, these three questions are easily answered, so there’s no problem in immediately starting the optimization.

However, in many real-life situations, things are not as simple as the example I provided. Performance evaluation may have multiple metrics, performance issues may occur simultaneously, and optimizing one metric may lead to a decline in performance for other metrics.

So what should we do in the face of these complex situations?

Next, let’s delve into the analysis of these three questions.

How to evaluate the effectiveness of performance optimization? #

Firstly, let’s look at the first question: how to evaluate the effectiveness of performance optimization.

The purpose of solving performance issues is to achieve performance improvement. To evaluate this improvement, we need to quantify the performance metrics of the system and test the metrics before and after optimization, comparing the changes in metrics to assess the effects. I call this method the “Three Steps Performance Evaluation.”

Determine the quantifiable performance metrics.
Test the performance metrics before optimization.
Test the performance metrics after optimization.

Let’s start with the first step. There are many quantifiable performance metrics, such as CPU utilization, throughput of the application, client request latency, etc., that can be used to evaluate performance. So, what metrics should we choose for evaluation?

My suggestion is to not limit yourself to a single-dimensional metric. You should select different metrics from the perspectives of the application and system resources. For example, in the case of a web application:

From the perspective of the application, we can use throughput and request latency to evaluate the performance of the application.
From the perspective of system resources, we can use CPU utilization to evaluate the CPU usage of the system. The reason for choosing indicators from these two different dimensions is mainly because of the complementary relationship between application programs and system resources.

A good application program is the ultimate goal and result of performance optimization, and system optimization always serves the application program. Therefore, it is necessary to use application program indicators to evaluate the overall effect of performance optimization.
The use of system resources is the root cause of affecting application performance. Therefore, it is necessary to use indicators of system resources to observe and analyze the source of bottlenecks.

As for the next two steps, they are mainly used to compare the performance before and after optimization and present the results more intuitively. If you selected multiple indicators from two different dimensions in the first step, you will need to obtain the specific values of these indicators during performance testing.

Take the previous web application as an example. Corresponding to the mentioned indicators above, you can use tools like ab to test the concurrent request count and response latency of the web application. At the same time, you can also use performance tools like vmstat and pidstat to observe the CPU usage of the system and processes. In this way, you obtain indicator values from both the application program and system resource dimensions.

However, there are two important things to note when conducting performance testing.

First, avoid performance testing tools from interfering with the performance of the application program. Usually, for web applications, the performance testing tools and the target application program should run on different machines.

For example, in the previous Nginx case, I always emphasized the use of two virtual machines, with one running the Nginx service and the other running the tool simulating the client, in order to avoid this interference.

Second, avoid the impact of changes in the external environment on the evaluation of performance indicators. This requires that the pre- and post-optimized application programs run on machines with the same configuration, and their external dependencies should also be completely consistent.

For example, take Nginx again. It can run on the same machine and use client tools with the same parameters for performance testing.

How to choose when multiple performance issues exist simultaneously? #

Let’s look at the second question. As mentioned at the beginning, system performance always affects everything, so performance issues are usually not independent. When multiple performance issues occur simultaneously, which one should be optimized first?

In the field of performance testing, there is a widely circulated saying called the “Pareto principle,” which means that 80% of the issues are caused by 20% of the code. Once you find the location of this 20%, you can optimize 80% of the performance. So, what I want to express is that not all performance issues are worth optimizing.

My suggestion is, before diving into optimization, analyze all these performance issues and identify the most important problem that can greatly improve performance. By starting with this problem, not only do you maximize the benefits of performance improvement, but you may also satisfy performance requirements without needing to optimize other problems.

The key is to judge which performance issue is the most important. This is actually the core issue to be solved in performance analysis, except that here we need to analyze multiple problems instead of just one. The approach is still the same as what I mentioned earlier: analyze each problem one by one and identify their bottlenecks. After analyzing all problems, exclude performance issues that have causality relationships. Finally, optimize the remaining performance issues.

If there are still several problems remaining, you need to conduct separate performance testing for each of them. After comparing the different optimization effects, choose the one that significantly improves performance for fixing. This process usually takes more time. Here, I recommend two methods that can simplify this process.

First, if it is found that system resources have reached a bottleneck, such as CPU usage reaching 100%, then the first optimization must be about system resource usage. After optimizing the system resource bottleneck, other issues should be considered.

Second, for different types of indicators, prioritize optimization of the problems caused by bottlenecks with the largest changes in performance indicators. For example, if a bottleneck occurs and the user CPU usage increases by 10%, while the system CPU usage increases by 50%, then you should first optimize the system CPU usage.

How to choose when there are multiple optimization methods? #

Now let’s consider the third question: when there are multiple methods available, which one should be chosen? Does the method that maximizes performance improvement always mean it is the best?

In general, we naturally want to choose the method that can maximize performance improvement, which is also the goal of performance optimization.

However, we need to consider factors that complicate the real situation. Performance optimization is not without cost. Performance optimization usually increases complexity, reduces program maintainability, and may even degrade the performance of other indicators when optimizing one indicator. In other words, it is very likely that while optimizing one indicator, the performance of another indicator worsens.

A typical example is DPDK (Data Plane Development Kit), which I will mention in the network section later. DPDK is a method for optimizing network processing speed. It improves network processing capability by bypassing the kernel network protocol stack.

However, it has a typical requirement, which is to occupy a dedicated CPU and a certain amount of large memory pages, and it always runs at 100% CPU usage. So, if you have a small number of CPU cores, it may not be worth it.

Therefore, when considering which performance optimization method to choose, you need to consider multiple factors. Remember, don’t think of optimization as a one-step process and try to solve all problems at once. Also, don’t just rely on “copy and paste” optimization methods from other applications without any thought or analysis.

CPU Optimization #

After understanding the three fundamental aspects of performance optimization, let’s now examine how to reduce CPU usage and improve CPU parallel processing from the perspectives of application programs and systems.

Application Program Optimization #

From the perspective of application programs, the best way to reduce CPU usage is to eliminate all unnecessary work and retain only the core logic. For example, reducing the levels of loops, recursion, and dynamic memory allocation.

In addition to this, application program performance optimization includes various methods, and I have listed some of the most common ones here for you to take note of.

Compiler Optimization: Many compilers provide optimization options, and enabling them can help improve performance during the compilation phase. For example, gcc offers the -O2 optimization option, which automatically optimizes the application program’s code when enabled.
Algorithm Optimization: Using algorithms with lower complexity can significantly accelerate processing speed. For example, when dealing with large data, you can use sorting algorithms with a complexity of O(nlogn) (such as QuickSort and Merge Sort) instead of O(n^2) sorting algorithms (such as Bubble Sort and Insertion Sort).
Asynchronous Processing: Asynchronous processing avoids program blocking while waiting for a certain resource and thus improves the program’s concurrent processing capability. For example, replacing polling with event notifications can avoid CPU consumption caused by polling.
Multiple Threads instead of Multiple Processes: As mentioned earlier, compared to process context switching, thread context switching does not switch process address spaces, thus reducing context switching costs.
Effective Cache Usage: Frequently accessed data or computational steps can be cached in memory so that they can be directly retrieved from memory in subsequent uses, thereby speeding up program processing.

System Optimization #

From the perspective of the system, optimizing CPU operations involves making full use of CPU cache locality to accelerate cache access and controlling CPU usage of processes to minimize interactions between processes.

Specifically, there are several common methods for CPU optimization at the system level. Here, I have also listed some of them for your convenience in memorizing and use.

CPU Affinity: Binding processes to one or more CPUs can improve CPU cache hit rates and reduce context switching problems caused by cross-CPU scheduling.
CPU Isolation: Similar to CPU affinity, CPU isolation further groups CPUs and allocates processes to them through CPU affinity mechanisms. This means that these CPUs are exclusively used by the designated processes, i.e., other processes are not allowed to use these CPUs.
Priority Adjustment: Adjusting process priorities using the nice command, lowering the value to reduce priority and increasing it to raise priority. The numeric meaning of priorities was mentioned earlier, so review it in time if forgotten. Here, lowering the priority of non-critical applications and raising the priority of core applications ensures that the core applications are processed with priority.
Setting Resource Limits for Processes: Using Linux cgroups to set upper limits on CPU usage for processes can prevent system resource exhaustion caused by certain application issues.
NUMA (Non-Uniform Memory Access) Optimization: Processors that support NUMA are divided into multiple nodes, with each node having its own local memory space. NUMA optimization aims to make CPUs access local memory as much as possible.
Interrupt Load Balancing: Both software interrupts and hardware interrupts can consume a significant amount of CPU resources during their handling. Enabling the irqbalance service or configuring smp_affinity balances interrupt processing across multiple CPUs automatically.

Avoid Premature Optimization #

After mastering the above optimization methods, I estimate that many people, even if they haven’t found any performance bottlenecks, can’t help but bring various optimization methods into actual development.

However, I believe you must have heard of Donald Knuth’s famous quote, “Premature optimization is the root of all evil,” and I strongly agree with this point. Premature optimization is not advisable.

On one hand, optimization brings increased complexity and reduces maintainability. On the other hand, requirements are not static. Optimizations made for the current situation may not be suitable for rapidly changing new requirements. In this way, when new requirements arise, these complex optimizations may actually hinder the development of new functionality.

Therefore, it is best to gradually refine and dynamically improve performance optimization, without pursuing a one-step solution, but rather ensuring that the current performance requirements can be met first. When performance is found to be unsatisfactory or when performance bottlenecks occur, the most important performance issues should be selected for optimization based on performance evaluations.

Conclusion #

Today, I have organized the common approaches and methods for optimizing CPU performance. When encountering performance issues, do not rush to optimize, but instead identify the most important issue that can yield the greatest performance improvement. Then, optimize from both the application and system perspectives.

By doing so, not only can you achieve the maximum performance improvement, but it is also likely that you won’t need to optimize other issues as you would already meet the performance requirements.

However, remember to resist the temptation of “optimizing CPU performance to the extreme” because the CPU is not the sole performance factor. In subsequent articles, I will also introduce more performance issues, such as memory, network, I/O, and even architecture design problems.

If you only focus on optimizing a single metric to the extreme without conducting a comprehensive analysis and testing, it may not necessarily result in overall gains.

Reflection #

Due to space limitations, I have only listed a few of the most common CPU performance optimization methods here. Apart from these, there are many other application or system resource-oriented performance optimization methods. I would like to invite you to discuss with me. Do you know any other optimization methods?

Feel free to leave a comment and discuss with me. You are also welcome to share this article with your colleagues and friends. Let’s practice in real-world scenarios and improve through communication.