33 When the Backend Service Becomes Noticeably Slower, Discuss Your Diagnostic Approach

33 When the backend service becomes noticeably slower, discuss your diagnostic approach #

In daily work, it is inevitable for applications or systems to encounter performance issues. Apart from IT enterprises of a certain scale or companies specializing in specific performance areas, most engineers may not become dedicated performance engineers. However, having a basic understanding of performance knowledge and skills is often necessary for daily work and is also one of the necessary conditions for engineers to advance. The ability to identify and solve performance issues is also a test of your knowledge, skills, and abilities.

Today, I want to ask you the question: When backend services show obvious “slowness,” what is your diagnostic approach?

Typical Answer #

Firstly, it is necessary to have a clearer definition of the problem:

  • Is the service suddenly becoming slow or is it observed to be slow after running for a long time? Has a similar problem occurred before?
  • What is the definition of “slow”? Can I understand it as an extended response time of the system to other requests?

Secondly, clarify the symptoms of the problem, which will help to locate the specific cause. Here are some thoughts to consider:

  • The problem may originate from the Java service itself or it may be simply affected by other services in the system. As an initial judgment, confirm if any unexpected program errors have occurred, such as checking the application’s own error logs.

For distributed systems, many companies implement more systematic logging, performance monitoring systems, etc. Some Java diagnostic tools can also be used for this diagnosis. For example, monitoring whether a large number of certain types of exceptions have occurred using JFR (Java Flight Recorder).

If there are exceptions, they may be a breakthrough point.

If not, you can first check the system-level resources and monitor if CPU, memory, and other resources are heavily occupied by other processes, and whether this occupation is inconsistent with the normal operation of the system.

  • Monitor the Java service itself, such as whether there are severe situations like Full GC observed in the GC logs, or if Minor GC is becoming longer. Using tools like jstat to obtain memory usage statistics is also a common method. Checking for deadlocks, etc. using tools like jstack.

  • If the specific problem cannot be determined yet, profiling the application is also a method, but because it will have a certain level of intrusion into the system, it is generally not recommended for production systems unless it is absolutely necessary.

  • Once the program error or JVM configuration problem is located, take corresponding remedial measures and then verify if it is resolved. Otherwise, the above steps may need to be repeated.

Analysis of the Examination Point #

Today, I have chosen a common and practical performance-related question. My answer consists of two parts.

  • Before giving a direct answer, let’s first discuss what the more precise definition of the problem is. Sometimes, the interviewer may not express it clearly, so it is necessary to confirm our understanding before going into a detailed response.
  • We should gradually narrow down the problem domain from different perspectives and levels of the system and application, in order to isolate the real causes as much as possible. The specific steps may vary, but with more experience in dealing with such problems, your intuition will become more sensitive.

Most engineers may not have had a comprehensive opportunity to diagnose performance issues. If asked about it, there is no need to be overly nervous. You can demonstrate your way of thinking in diagnosing problems and showcase your knowledge and ability to apply it comprehensively. When encountering an unfamiliar problem, being able to systematically determine the troubleshooting steps through communication is also a demonstration of your abilities.

The interviewer may inquire more deeply about a certain aspect of performance diagnosis. Considering the requirements of both work and the interview, I will provide some introductions below. The goal is to give you an overall impression of performance analysis so that when faced with specific domain problems, even if you don’t know the specific details of tools and techniques, you can at least find the direction to explore and query.

  • I will introduce the commonly used performance analysis methodologies in the industry.
  • From system analysis to JVM and application performance analysis, grasp the overall thinking process and key tools. I have already introduced many aspects such as thread states and JVM memory usage in previous columns, and today’s lecture can also be regarded as a summary focusing on performance.

If you are interested in systematic learning, I recommend referring to “Java Performance” by Charlie Hunt or “Java Performance: The Definitive Guide” by Scott Oaks. Additionally, it is better to read the English version to avoid any misunderstandings.

Knowledge Expansion #

First, let’s understand the most widely used performance analysis methodology in the industry.

Depending on the system architecture, there are differences in approaches between distributed systems and large monolithic applications. For example, performance bottlenecks in distributed systems may be more concentrated. Traditionally, performance tuning has mostly focused on optimizing monolithic applications, and this column also focuses on that. Charlie Hunt summarized his methodology into two categories:

  • Top-down approach: Starting from the top level of the application and gradually delving into specific modules or even more technical details, finding possible problems and solutions. This is the most common approach to performance analysis and the choice of most engineers.
  • Bottom-up approach: Starting from low-level hardware like the CPU, identifying problems and optimization opportunities such as Cache-Miss, focusing on instruction-level optimization. This is often a skill that professional performance engineers possess, and it requires professional tools. It is usually done when porting to a new platform or when extreme performance is required.

For example, when porting a big data application to SPARC architecture hardware, it is necessary to compare and maximize performance potential while minimizing changes to the source code.

The answer I provided first attempts to eliminate functional errors and then follows the typical top-down analysis approach.

Second, let’s take a look at the common tools and approaches for each stage of the top-down analysis. It’s important to note that specific tools can vary greatly depending on the operating system.

In system performance analysis, CPU, memory, and IO are the main areas of focus.

For CPU, if you are using Linux, you can start by using the top command to view the load condition. The screenshot below is a state I captured.

As you can see, the average load values (1 minute, 5 minutes, 15 minutes) are very low and there doesn’t seem to be any sign of increase at the moment. If these values are very high (e.g., over 50% or 60%) and the short-term average is higher than the long-term average, it indicates heavy load. If there is also an increasing trend, then extra caution should be taken.

There are many further steps you can take for investigation. For example, in Lecture 18 of this column, I asked how to find the most CPU-consuming Java thread. Here is a brief overview of the steps:

  • Use the top command to obtain the corresponding PID. Use the -H option for thread mode and you can use the grep command for more precise positioning.
top -H
  • Convert it to hexadecimal format.
printf "%x" your_pid
  • Finally, use jstack to get the thread stack and compare it with the corresponding ID.

Of course, there are more general diagnostic directions, such as using vmstat to view the number of context switches. For example, the following command collects data 10 times with a specified time interval of 1.

    vmstat -1 -10

The output is as follows:

![](../images/abd28cb4a771365211e1a01d628213a0-20221127211745-e9gnbqh.png)

If the number of context switches per second (cs, [context switch](https://en.wikipedia.org/wiki/Context_switch)) is very high and much higher than system interrupts (in, system [interrupt](https://en.wikipedia.org/wiki/Interrupt)), it likely indicates that unreasonable thread scheduling is causing the issue. Of course, more specific analysis using tools like [pidstat](https://linux.die.net/man/1/pidstat) is needed for further pinpointing, but I won't go into further details here.

Apart from CPU, memory and IO are important considerations, such as:

  * Use commands like `free` to check memory usage.
  * Alternatively, you can further analyze swap usage. In the output of the `top` command, Virt represents virtual memory usage, which is the sum of physical memory (Res) and swap. Therefore, swap usage can be deduced. Obviously, JVM doesn't want excessive swap usage.
  * For IO issues, they may occur with disk IO or network IO. For example, using commands like `iostat` can help determine disk health. I have helped diagnose performance problems with Java services deployed on machines provided by a cloud vendor in China, and the root cause was poor IO performance, which was affecting overall performance. The solution was to request a machine replacement.



Speaking of which, if you're very interested in system performance, I recommend referring to the comprehensive diagram provided by [Brendan Gregg](https://www.brendangregg.com/linuxperf.html). What I have introduced here can only be considered as a small portion. However, I still suggest combining it with actual needs to avoid getting lost in the details.

![](../images/93aa8c4516fd2266472ca4eab1b0cc40-20221127211745-0mrzxg2.png)

For **JVM-level performance analysis**, we have already covered a lot:

  * Use tools like JMC, JConsole, etc., for runtime monitoring.
  * Use various tools to perform heap dump analysis at runtime or obtain statistical data from various perspectives (e.g., analyzing GC and memory zones using [jstat](-gcutil)).
  * Use GC logs, etc., to diagnose Full GC, Minor GC, or reference accumulation, and so on.



There is no one-size-fits-all method here, and specific problems can be very different. It also depends on whether you can make full use of these tools to gradually identify the cause of the problem based on various signs.

For **application** [Profiling](https://en.wikipedia.org/wiki/Profiling_\(computer_programming\)), in simple terms, it is the use of intrusive methods to collect runtime details of a program to identify performance bottlenecks. These details include memory usage, the most frequently called methods, context switch situations, etc.

In the typical answers I mentioned earlier, I generally do not recommend profiling in production systems, as it is mostly done in performance testing stages. However, when there is a genuine need in a production system, there are also options. I recommend using JFR in combination with [JMC](http://www.oracle.com/technetwork/java/javaseproducts/mission-control/java-mission-control-1998576.html) for profiling. JFR collects low-level information from the Hotspot JVM internally and has been heavily optimized, so its performance overhead is very low, usually less than **2%**. Furthermore, such a powerful tool has also been open-sourced by Oracle!

Therefore, JFR/JMC fully possesses the ability to profile production systems, and it has indeed been used in real-world large-scale deployments to quickly identify issues.

Its usage is also very convenient, as you do not need to restart the system or add configurations in advance. For example, you can start JFR recording at runtime and write the information for a certain duration to a file:

    Jcmd <pid> JFR.start duration=120s filename=myrecording.jfr

Then, use JMC to open the ".jfr" file for analysis. It provides features like methods, exceptions, threads, IO, etc., and its functionality is very powerful. If you want to learn more details, you can refer to the relevant [guide](https://blog.takipi.com/oracle-java-mission-control-the-ultimate-guide/).

Today, starting from a typical performance problem and systematically organizing the thought process for common performance analysis, from symptom presentation to specific system analysis and JVM analysis, I have also provided knowledge extensions from a methodological and practical perspective, allowing you to combine theory and practice. I believe it can be helpful to you.
## Exercise

Do you have a clear understanding of the topic we discussed today? Today's question is: What are the main ways to obtain data using profiling tools? What are their advantages and disadvantages?

Please write your thoughts on this question in the comments section. I will select thoughtful comments and give you a learning reward coupon. Feel free to discuss with me.

Are your friends also preparing for interviews? You can "invite friends to read" and share today's topic with them. Perhaps you can help them.