21 the Process Methodology of Performance Optimization and Summary of Job Interview Experiences

21 The Process Methodology of Performance Optimization and Summary of Job Interview Experiences #

After previous studies, you may find that even if you are familiar with various technologies and optimization techniques in development, it is still difficult for you to conduct optimization tasks in real performance optimization scenarios. This is because the knowledge in your mind is scattered and can only be used to complete one-sided optimizations based on memory.

In this case, I usually prepare a detailed outline at hand, so that during performance optimization, it can provide me with guidance and enable me to think in a comprehensive way.

Therefore, today I want to summarize the process methods of performance optimization, hoping to provide you with guidance when you are analyzing performance but have no idea where to start.

Performance optimization requires multiple considerations #

There are many factors contributing to low application performance, such as business requirements, architectural design, hardware/software considerations, etc. The main focus of our column is the software aspect of performance optimization, but we should not forget that there are other means of performance optimization.

Let’s start with an example from the business requirements aspect. There is a reporting business that has very slow queries and sometimes even causes memory overflow. After analysis, it is found that it is caused by a large time span of the queries.

Due to business restrictions, we narrowed down the time span to within one month, and after that, the query speed increased significantly.

Here’s an example from the hardware aspect. There is a scheduled task that can be considered CPU-intensive and always pushes the CPU to its limits. Due to architectural limitations, horizontal scaling cannot be achieved. After technical evaluation, it would take up to one month to refactor it to execute in a data sharding mode.

In fact, in this case, we can solve the performance bottleneck problem by increasing the hardware configuration, thus buying more time for business improvement.

The purpose of these examples is to illustrate that there are many ways to optimize performance. If a performance problem can be solved by other means, try to avoid adjusting the software code. We should try to strike a balance among the three aspects: effectiveness, man-hours, and means.

How to find optimization targets? #

Usually, when focusing on a hardware resource (such as the CPU), we mainly pay attention to the following basic elements.

Utilization rate: generally an instantaneous value, belonging to a sampling range, used to determine whether there are peak values, such as CPU usage.
Saturation: generally refers to whether resources are being utilized reasonably and whether they can take on more work. For example, if the saturation is too high, new requests are queued in a specific queue; likewise, if the memory utilization rate is too low or the CPU utilization rate is too high, space-time trade-offs should be considered.
Error messages: errors generally occur in severe cases and require special attention.
Associative information: guessing the causes and verifying the guesses with more tools. The guessed influencing factors are not necessarily accurate, but they help us analyze the problem. For example, a slow system response is likely caused by excessive use of SWAP.

First, we need to find the goal of performance optimization, and we will look at possible hidden performance bottlenecks from the CPU, memory, network, I/O, and other aspects.

1. CPU #

For viewing CPU usage, the top command can be used, paying special attention to its load and usage. The vmstat command can also show the system’s running status, and in this case, we are concerned about context switches and swap partition usage.

2. Memory #

To check memory usage, the free command can be used, paying special attention to the amount of remaining memory. For Linux systems, after startup, due to various caches and buffers, system memory will quickly be occupied, so we are more concerned about JVM memory.

The RES column in the top command shows the actual physical memory occupied by the process, which is usually larger than the heap memory obtained by the jmap command because it also includes a large amount of off-heap memory space.

3. Network #

iotop can show the processes that occupy the highest network traffic, while the netstat or ss command can show the network connections on the current machine. In some lower-level optimization, it may involve network optimization based on MTU.

4. I/O #

By using the iostat command, you can check the usage of disk I/O. If the utilization is too high, you need to find the source of the usage. Similar to iftop, iotop can show the processes that occupy the most I/O, making it easy to find optimization targets.

5. General #

The lsof command can show all the resources associated with the current process; the sysctl command can show the configuration parameters of the current kernel; the dmesg command can display some system-level information. For example, processes killed by the oom-killer of the operating system can be found here.

I have prepared a mind map for you as a reference:

Common tool collection #

In order to find the problems in the system, we adopt a trial and error approach, using multiple tools and methods to obtain the running status of the system.

1. Information Collection #

nmon is a command-line tool that can output overall system performance data and is widely used.

jvisualvm and jmc are both tools used to obtain Java application performance data. Since they are UI tools, the application needs to enable JMX ports to be remotely connected.

2. Monitoring #

Commands like top are only effective when problems occur. However, often when performance issues occur, we are not by the computer, so we need a set of tools to regularly capture this performance data. By monitoring the system, we can obtain historical time series of monitoring indicators, analyze indicator trends, estimate performance bottlenecks, and support our analysis with data.

Currently, the most popular combination is prometheus + grafana + telegraf, which can build a powerful monitoring platform.

3. Load Testing Tools #

Sometimes, we need to evaluate the performance of a system under certain concurrency levels, and in this case, we can use load testing tools to put pressure on the system.

wrk is a command-line tool that can load test HTTP APIs; jmeter is a more professional load testing tool that can generate load testing reports. By combining load testing tools with monitoring tools, we can accurately evaluate the current performance of the system.

4. Performance Profiling Tools #

In most cases, we can’t know the specific details of performance bottlenecks just by general performance indicators, so we need some more in-depth tools for tracking.

skywalking can be used to analyze call chain issues in distributed environments and can provide detailed information about the execution time of each step. But if you don’t have such an environment, you can use the command-line tool arthas to trace the methods and ultimately identify specific slow logic.

jvm-profiling-tools can generate flame graphs to assist in problem analysis. In addition, there are also lower-level performance measurement and tuning tools for operating systems, such as perf and SystemTap. Interested students can study them on their own.

Regarding the content of tools, you can review “04 | Tool Practice: How to obtain code performance data?” and “05 | Tool Practice: Benchmark testing with JMH, precise measurement of method performance” for a review. I have prepared a mind map for you to refer to.

Basic Solutions #

Once the specific performance bottlenecks are identified, we can optimize them accordingly.

1. CPU Issues #

CPU is the core resource of the system. If the CPU is bottlenecked, many tasks and threads will not get enough time slices, resulting in slow performance. If the system’s memory is sufficient, it is worth considering whether we can use more efficient algorithms or data redundancy to reduce CPU usage.

On Linux systems, you can easily obtain the threads with the highest CPU usage using the command top -Hp and perform targeted optimizations.

Resource usage needs to be subdivided in order to perform specialized optimizations.

I once encountered a tricky performance problem where all threads were blocked on the ForkJoin thread pool. After careful investigation, I found that the code was using parallel streams (parallelStream) to process while waiting for time-consuming I/O. However, the default behavior of Java is that all uses of parallel streams share a common thread pool, which has a parallelism of only twice the number of CPUs. As a result, when the number of requests increases, tasks will queue up and cause a backlog.

2. Memory Issues #

Memory issues are often related to OutOfMemoryError (OOM), and you can refer to “19 | Advanced Advanced: Common JVM optimization parameters” for optimization. If memory resources are scarce and CPU utilization is low, you can consider trading space for time.

SWAP partitions use the hard disk to extend the available memory, but they are very slow. In general, in high-concurrency applications, SWAP is turned off because it can easily cause lagging.

3. I/O Issues #

The business systems we usually develop have relatively small disk I/O loads but more busy network I/O.

When encountering high disk I/O usage, consider whether it is caused by excessive logging. By adjusting the log level or cleaning up unused log code, the pressure on disk I/O can be relieved.

Business systems also involve a large amount of network I/O operations, such as calling a remote service through RPC. We expect to use NIO to reduce unnecessary waiting or use parallel processing to speed up information acquisition.

Another case is database applications like Elasticsearch (ES), where writing data itself causes heavy disk I/O. In this case, you can increase hardware configuration, such as switching to SSD disks or adding new disks.

Database services themselves provide a large number of parameters for performance tuning. As described in “06 | Case Study: How does a buffer accelerate code?” and “07 | Case Study: Ubiquitous Caching, the Magic Weapon of High-Concurrency Systems,” these configuration parameters mainly affect the behavior of buffers and caches.

For example, in ES, the segment block size and translog flush rate can be fine-tuned. For example, when a large amount of logs are written to ES, you can increase the interval for translog writing to disk to achieve significant performance improvements.

4. Network Issues #

The main factor that affects the transfer of data packets over the network is the size of the result set. By removing unnecessary information and enabling reasonable compression, significant performance improvements can be achieved.

It is worth noting that the network transmission mentioned here applies not only to browsers but also to service-to-service communication. For example, in the configuration file of SpringBoot, gzip can be enabled by setting the following parameters:

server:
  compression:
    enabled: true
    min-response-size: 1024
    mime-types: ["text/html","application/json","application/octet-stream"]

However, when this SpringBoot service retrieves information from another service through the Feign interface, the result set is not compressed. By replacing the underlying network tool of Feign with OkHTTP and using OkHTTP’s transparent compression (gzip is enabled by default), information compression can be achieved for service-to-service calls. However, many students easily forget about this. I once optimized a project and compressed the returned data package from 9MB to about 300KB, greatly reducing network transmission and saving about 500ms of time.

Another issue with network I/O is frequent network interactions. By merging result sets and using batch processing, performance can be significantly improved. However, this approach has limited use cases and is more suitable for asynchronous task processing.

The netstat or lsof command can be used to obtain the number of TIME_WAIT and CLOSE_WAIT network states associated with a process. The former can be resolved by adjusting kernel parameters, but the latter is mostly due to application bugs.

I have created a mind map for you to reference.

With the information gathering and initial optimization mentioned above, I believe you should have a very detailed understanding of the system to be optimized in your mind, and it’s time to change the design of some existing code.

If the basic solution mentioned above is “oriented towards the system,” then the optimization at the code level is “oriented towards specific performance bottlenecks.”

Code Level #

Code-level optimization is the focus of our course. We have spent a lot of time explaining this aspect in the entire “Module 3: Practical Cases and High-Frequency Interview Points” section. In this session, I will briefly summarize.

1. Middleware #

The performance bottleneck in inter-resource calls primarily lies in the difference in resource speed. The solution is to introduce a middleware layer that has buffering/caching and pooling capabilities to speed up information processing at the cost of sacrificing information timeliness.

Buffering allows both parties to operate at their own pace while being seamlessly connected. It can eliminate the speed difference between the two parties and reduce performance loss in a batch manner.

You can review and revisit “06 | Case Analysis: How Buffers Accelerate Code” for more information.

Caching is widely used in systems and can be divided into heap-based caching and distributed caching. Some scenarios with high performance requirements may even have a combination of multiple levels of caching. Our goal is to increase the cache hit rate as much as possible so that the middleware layer can make full use of it.

You can review and revisit “07 | Case Analysis: Ubiquitous Caching, the Magic Weapon of High-Concurrency Systems” for more information.

Another form of middleware is centralized resource management, which reduces the cost of object creation through pooling. The value of pooling can only be reflected when the creation cost of objects is relatively large; otherwise, it will only increase the complexity of the code.

You can review and revisit “09 | Case Analysis: Application Scenarios for Object Pooling” for more information.

2. Resource Synchronization #

In our coding, sometimes the consistency requirements for data are relatively high, and we have no choice but to use locks and transactions, whether it be thread locks, distributed locks, or optimistic locks suitable for scenarios with more reads and fewer writes. There are some common optimization rules.

First, divide the conflicting resources into smaller granularities, so that they can be dealt with separately.
Second, reduce the time for resource locking and release shared resources as soon as possible.
Third, separate read operations from write operations to further reduce the possibility of conflicts.

Ordinary transactions can be easily implemented through Spring’s @Transactional annotation, but usually business involves multiple heterogeneous resources. Unless necessary, it is highly discouraged to use distributed transactions to solve this problem. Instead, the idea of eventual consistency should be adopted by moving mutex operations from the resource layer to the business layer.

3. Organizational Optimization #

Another effective way is through refactoring to change the organization structure of our code.

Design patterns can make our code logic clearer, and during performance optimization, we can directly locate the code that needs optimization. I have seen many application code that needs performance tuning, and due to the complexity of object relationships and code organization, it is quite difficult to add a middleware layer. In this case, the first task is to sort out and refactor this code, otherwise it will be difficult to further optimize performance.

Another factor that has a significant impact on programming patterns is asynchrony.

Asynchrony often uses the producer-consumer pattern to reduce the performance loss caused by synchronous waiting. However, this programming model is more difficult and requires a lot of extra work. For example, if we use messaging queues (MQ) to achieve asynchrony, we have to consider message failure, duplication, dead-letter, and other assurance functions (changes in product form are beyond the scope of discussion).

4. Underutilized Resources #

It is not true that the lower the utilization rate of the system’s resources, the better our code is. As a developer, we should try our best to squeeze the remaining value of the system and rotate all resources. This rotation is particularly important in high-concurrency scenarios and is the optimal state of the system under certain pressure.

Failure to make reasonable use of resources is a waste. For example, most business applications are I/O-intensive. If requests are blocked by I/O, it will result in CPU resource waste. In this case, parallelism can be used to handle more tasks at the same time and increase the concurrency; for example, if we find that the heap free space of the JVM is consistently high, we can consider increasing the capacity of the heap cache or the buffer.

I have created a mind map for your reference.

PDCA Methodology #

Performance optimization is an iterative process that requires real-time adjustments based on data feedback. Sometimes, test results may show that some optimization methods are not effective, so it is necessary to roll back to the previous version and find new breakthroughs.

As shown in the above diagram, the PDCA (Plan-Do-Check-Act) cycle methodology can support our performance optimization process. It consists of four steps:

P (Planning): Identify performance issues, collect performance metrics, determine improvement goals, and prepare specific measures to achieve these goals.
D (Do): Implement the optimization measures according to the plan.
C (Check): Timely check the effectiveness of the optimization and identify experiences and problems during the improvement process.
A (Act): Propagate successful optimization experiences, gradually cover more areas, provide solutions for negative impacts, and turn incorrect methods into experiences.

By repeating this cycle, the performance of the application will gradually improve. As shown in the diagram below, performance optimization can be abstracted in the following way.

Since it is called a cycle, it means that this process can be repeated. In fact, with our efforts, the application performance will spiral upward and eventually reach our expectations.

Interview Tips #

1. Pay attention to the side effects of “performance optimization” #

Performance optimization questions in interviews are usually embedded in other topics. You not only need to focus on “performance optimization” itself but also pay attention to the problems that may arise after “performance optimization”. After you provide the performance optimization plan the interviewer is looking for, they will follow up with questions about the other problems caused by the plan. Our column has described most of these side effects caused by performance optimization, and you need to master this knowledge attentively.

2. Master the basics of “performance optimization” #

In addition, from the above summary, we can see that performance optimization involves many knowledge points. How can you demonstrate your abilities as much as possible within the limited interview time? The key is to build a solid foundation of knowledge and be able to answer questions in detail and accurately.

What optimizations have you done on the JVM that have improved performance?
Why is optimistic locking commonly used in Internet scenarios?

The above two questions can be answered relatively easily because their answers are relatively certain. You just need to explain the specific knowledge points clearly. However, more challenging questions may be:

3. Prepare in advance for divergent and comprehensive questions #

If the above questions are about “specific points”, then the following questions are about a “whole”.

What performance optimization work have you done in your projects?
How do you guide teams in performance optimization?

If you only describe a specific knowledge point, your answer will appear very limited. In fact, you can systematically describe the discovery, resolution, and verification of problems from various aspects, and specifically discuss the most important and familiar knowledge points during this process.

Therefore, I recommend that you prepare practical cases from your projects (or reasonable deductions if you don’t have practical experience) for these two types of questions before the interview. This way, when faced with these questions, you can respond quickly and impress the interviewer.

Summary #

In this lesson, we have summarized the content of previous lessons, and ultimately, performance optimization can be summarized as follows: identify the optimization goals → use tools to obtain more performance data → basic solutions for performance optimization → code-level optimization → procedural methods, and the PDCA cycle methodology, which supports this process. The performance of an application depends on such iterative optimization, gradually accumulating its effects.

Finally, I also briefly introduced “interview tips” to help you go further in your career path.