01 How to Establish Performance Tuning Standards

01 How to establish performance tuning standards #

Hello, I’m Liu Chao.

Once, a friend of mine told me that their company’s system had never undergone performance tuning. After completing functional testing, it would go live without encountering any performance issues. So he wondered why many systems still need to undergo performance tuning.

At that time, I responded by saying, “If your company is responsible for the 12306 website in China, try going live without optimizing the system and see what happens.”

If it were you, how would you respond? Today, let’s start from this topic and hope to understand these questions together: Why do we need to do performance tuning? When should we start? Are there any standards to reference when doing performance tuning?

Why should we do performance tuning? #

If an online product has not undergone performance testing, it is like a time bomb. You don’t know when it will encounter problems, and you don’t know its limits.

Some performance issues accumulate over time and eventually explode. Many performance issues are caused by fluctuations in traffic, such as an increase in users during an event or for a company’s product. It is also possible for a product to be launched but not receive heavy traffic, thus avoiding the time bomb effect.

Now, let’s say your system is about to run an event, and the product manager or boss tells you that they expect hundreds of thousands of users to access the system. They ask whether the system can handle the pressure of this event. If you are not clear about the performance of your system, you can only nervously respond to the boss, saying that it might be okay.

Therefore, the question of whether or not to do performance tuning is actually easy to answer. After the development of any system, there will always be some performance issues to a certain extent. The first thing we need to do is to expose those problems, for example, through stress testing and simulating potential usage scenarios. Then, we can address these issues through performance tuning.

For example, when you use a certain app to search for a particular piece of information, you have to wait for more than ten seconds. Or, during a rush purchase event, you cannot access the event page. You see, system responsiveness is the most direct factor that reflects system performance.

But what if the system doesn’t have any responsiveness issues online? Does that mean we don’t need to perform performance optimization? Let me tell you another story.

In my previous company, there was a genius in the system development department. He was called a genius because during his one year at the company, he only did one thing: reduce the number of servers by half while actually improving the system’s performance metrics.

Good performance tuning not only enhances system performance but also saves resources for the company. This is the most direct purpose of doing performance tuning.

When to start performance tuning? #

After solving the question of why we need to do performance optimization, a new question arises: When should we start performance tuning the system? Is it better to start as early as possible?

In fact, during the early stages of project development, we don’t need to focus too much on performance optimization. This would only exhaust us with performance optimization tasks, which not only fails to improve system performance, but also affects development progress. It might even have the opposite effect and introduce new problems to the system.

We only need to ensure effective coding at the code level. For example, reducing disk I/O operations, reducing the use of competing locks, and using efficient algorithms, among others. When encountering complex business scenarios, we can fully leverage design patterns to optimize the business code. For example, when designing product pricing, there are often multiple discount and coupon activities. We can use the decorator pattern to design this business logic.

After the system code is completed, we can start performance testing the system. At this point, the product manager usually provides expected data for us to conduct stress testing on the reference platform provided. We use performance analysis and statistical tools to measure various performance indicators and see if they fall within the expected range.

After the project is successfully launched, we still need to observe system performance issues based on actual conditions online. This can be done through log monitoring and performance statistics logs. Once an issue is discovered, we need to analyze the logs and promptly fix the problem.

What are the reference factors that can reflect system performance? #

Earlier, we talked about how performance optimization is involved in various stages of project development, and we mentioned performance indicators multiple times. So, what are the performance indicators?

Before we understand the performance indicators, let’s first understand which computer resources can become system performance bottlenecks.

CPU: Some applications require a large amount of calculations and they occupy the CPU resources for a long time without interruption. This can lead to other resources being unable to compete for CPU and result in slow response, thus causing system performance issues. For example, infinite loops caused by recursive code, backtracking caused by regular expressions, frequent FULL GC in JVM, and a large number of context switches caused by multi-threaded programming can all lead to CPU resource congestion.

Memory: Java programs generally allocate and manage memory through the JVM, mainly using the heap memory in JVM to store objects created by Java. The read and write speed of system heap memory is very fast, so there is basically no read and write performance bottleneck. However, compared to disk, memory is more expensive and its storage space is very limited. Therefore, when memory space is occupied and objects cannot be reclaimed, it can lead to issues such as memory overflow and memory leaks.

Disk I/O: Compared to memory, disks have much larger storage space, but the read and write speed of disk I/O is slower than that of memory. Although the currently introduced SSD solid-state drives have been optimized to a certain extent, they still cannot match the read and write speed of memory.

Network: Network plays a crucial role in system performance. If you have purchased cloud services, you must have gone through the process of choosing the network bandwidth. If the bandwidth is too low, it can easily become a performance bottleneck for systems that transmit large amounts of data or have high concurrency.

Exceptions: In Java applications, throwing exceptions requires constructing an exception stack and handling exceptions. This process consumes a significant amount of system performance. If exceptions occur in high-concurrency situations and are continuously processed, it can significantly affect the system performance.

Database: Most systems use databases, and database operations often involve disk I/O read and write. A large number of database read and write operations can cause disk I/O performance bottlenecks, resulting in delayed database operations. For systems with a large number of database read and write operations, optimizing database performance is crucial to the entire system.

Lock contention: In concurrent programming, we often need multiple threads to share and read/write to the same resource. In order to maintain data atomicity (i.e., ensuring that the shared resource is not modified by another thread while one thread is writing to it), we use locks. The use of locks may introduce context switches, thereby incurring performance overhead for the system. After JDK 1.6, Java has made multiple optimizations to JVM internal locks in order to reduce the context switches caused by lock contention. For example, it introduced biased locking, spin locks, lightweight locks, lock coarsening, lock elimination, etc. Properly using and optimizing lock resources requires a deeper understanding of operating system knowledge, Java multi-threading programming basics, accumulated project experience, and practical scenarios.

After understanding the above basic content, we can derive the following indicators to measure the performance of a general system.

Response Time #

Response time is one of the important indicators to measure system performance. The shorter the response time, the better the performance. Generally, the response time of an API is measured in milliseconds. In a system, response time can be divided into the following categories from bottom to top:

img

  • Database response time: The time consumed by database operations is often the most time-consuming in the entire request chain.
  • Server response time: This includes the time consumed by requests distributed by Nginx and the time consumed by server program execution.
  • Network response time: This is the time consumed by network hardware for parsing and other operations during network transmission.
  • Client response time: For ordinary web and app clients, the time consumed is negligible. However, if your client embeds a lot of logical processing, the time consumed may become longer and become a bottleneck for the system.

Throughput #

In testing, we often pay more attention to the TPS (Transactions Per Second) of the system interface because TPS reflects the performance of the interface. The larger the TPS, the better the performance. In the system, we can also divide throughput into two types from the bottom up: disk throughput and network throughput.

Let’s first look at disk throughput. There are two key indicators for disk performance.

One is IOPS (Input/Output Per Second), which refers to the number of input and output operations (or read and write operations) per second. This measures the number of I/O requests the system can handle within a unit of time. I/O requests are typically read or write data operations and focus on random read and write performance. It is suitable for applications with frequent random read and write operations, such as small file storage (e.g., images), OLTP databases, and mail servers.

The other one is data throughput, which refers to the amount of data that can be successfully transmitted within a unit of time. For applications with a large amount of sequential read and write operations, such as video editing at a TV station or Video On Demand (VOD) services, data throughput is a key indicator.

Next, let’s look at network throughput, which refers to the maximum data rate that a device can accept without frame loss during network transmission. Network throughput is not only related to bandwidth but also closely related to CPU processing power, network cards, firewalls, external interfaces, and I/O. The size of throughput is mainly determined by the processing power of the network card, internal program algorithms, and bandwidth.

Computer resource allocation and utilization #

Usually represented by CPU usage rate, memory usage rate, disk I/O, and network I/O. These parameters are like the boards of a barrel. If any of them becomes a weak point or if any allocation is unreasonable, it can have a devastating impact on the overall system performance.

Load capacity #

When the system comes under pressure, you can observe whether the curve of the system response time rises smoothly. This indicator can intuitively reflect the maximum load pressure that the system can withstand. For example, when you perform load testing on a system, the system’s response time will increase as the concurrency of requests increases, until the system is unable to handle so many requests and throws a large number of errors, indicating that it has reached its limit.

Summary #

Through today’s learning, we know that performance tuning can make the system stable, improve user experience, and even help save resources for large-scale systems.

However, in the early stages of a project, it is not necessary to intervene in performance optimization too early. It is sufficient to ensure excellence, efficiency, and good program design during coding.

After completing the project, we can start system testing. We can use the following performance metrics as standards for performance tuning: response time, throughput, computer resource allocation utilization, and load-bearing capacity.

Looking back at my own project experience, I have worked on e-commerce systems, payment systems, and game recharge and billing systems, all of which have millions of users and need to withstand various large-scale buying events. Therefore, my requirements for system performance are very strict. In addition to observing the above metrics to determine the performance of the system, it is also necessary to ensure the stability of the system during updates and iterations.

Here, **I’d like to provide you with an additional method, which is to use the performance metrics of the previous version of the system as a reference benchmark and use automated performance testing to verify if the performance of the system after iterative releases is abnormal. In this case, we not only compare the direct metrics such as throughput, response time, and load capacity, but also compare the changes in indirect metrics such as CPU usage, memory usage, disk I/O, network I/O, and so on.