04 Tool Implementation- How to Obtain Code Performance Data #

First, let’s answer the question from the previous lesson. If disk speed is slow, why does Kafka still achieve such high throughput when operating on the disk?

The reason is that the slowness of the disk mainly lies in the seek operation. Kafka’s official tests have shown that the seek time can be as long as 10ms. The speed comparison between sequential write and random write can reach up to 6,000 times, and Kafka uses sequential write.

As we learned in the previous lesson, in order to perform a thorough investigation, we need to collect detailed performance data, including operating system performance data, JVM performance data, and application performance data.

So, how can we obtain this data? In this lesson, I will introduce a series of commonly used performance testing tools.

nmon — Obtain System Performance Data #

In addition to the commands such as top and free introduced in the previous lesson, there are also some monitoring tools that consolidate resources.

nmon is a pioneer Linux performance monitoring tool. It not only has a beautiful monitoring interface (as shown in the figure below), but also generates detailed monitoring reports.

Drawing 0.png

When evaluating application performance, I usually include nmon reports, which make the test results more convincing. You can also try this in your daily work.

The operating system performance metrics introduced in the previous lesson can all be obtained from nmon. It has a wide monitoring range, including CPU, memory, network, disk, file system, NFS, system resources, and other information.

nmon is available for download on sourceforge. I have already downloaded it and uploaded it to the repository. For example, for my CentOS 7 system, simply choose the corresponding version to execute.

./nmon_x86_64_centos7

Press C to add the CPU panel; press M to add the memory panel; press N to add the network; press D to add the disk, and so on.

By using the following command, data is collected every 5 seconds and a total of 12 times. It will record the data within this period of time. For example, this time it generates a file named localhost_200623_1633.nmon, which we can download from the server.

./nmon_x86_64_centos7  -f -s 5 -c 12 -m  -m .

Note: After executing the command, the process can be found using the ps command.

[root@localhost nmon16m_helpsystems]# ps -ef| grep nmon
root      2228     1  0 16:33 pts/0    00:00:00 ./nmon_x86_64_centos7 -f -s 5 -c 12 -m .

Use the nmonchart tool (see repository) to generate an HTML file. The following screenshot shows the generated file.

Drawing 1.png nmonchart report

jvisualvm — Obtain JVM Performance Data #

jvisualvm was originally released as part of the JDK, but starting from Java 9, it has been released as a separate tool. With jvisualvm, you can understand the internal status of an application during runtime. You can connect to local or remote servers to monitor a large amount of performance data.

Through the plugin feature, jvisualvm can have even more powerful extensions. As shown in the following figure, it is recommended to download all the plugins for a better experience.

Drawing 2.png jvisualvm plugin installation

To monitor a remote application, you need to add JMX parameters to the application being monitored.

-Dcom.sun.management.jmxremote.port=14000
-Dcom.sun.management.jmxremote.authenticate=false 
-Dcom.sun.management.jmxremote.ssl=false

The above configuration opens the JMX connection port 14000 and configures it to connect without SSL security authentication.

For performance optimization, we mainly use its sampler. Note that the sampling analysis process has a significant impact on the performance of program execution, so we generally only use this function in the testing environment.

Drawing 3.png

jvisualvm CPU performance sampling chart

For a Java application, in addition to CPU metrics, garbage collection is also an important performance point that cannot be ignored. We mainly focus on the following three points:

CPU analysis: Count the number of method executions and their execution time. This data can be used to analyze which method has a long execution time and becomes the hot spot.
Memory analysis: Analyze memory using methods such as memory monitoring and memory snapshots to detect memory leak problems and optimize memory usage.
Thread analysis: View thread status changes and detect deadlock situations.

JMC — Obtain Detailed Performance Data for Java Applications #

For the commonly used HotSpot, there is a more powerful tool called JMC (Java Mission Control). JMC integrates a very useful function called JFR (Java Flight Recorder).

Flight Recorder is derived from the black box used in airplanes, which is used to record information for later analysis. In Java 11, it can be recorded using the jcmd command, mainly including the following five commands: configure, check, start, dump, and stop. The order of execution is start, dump, and then stop. For example:

jcmd <pid> JFR.start
jcmd <pid> JFR.dump filename=recording.jfr
jcmd <pid> JFR.stop

JFR is built into the JVM and does not require additional dependencies. It can be used directly. It can monitor a large amount of data. For example, the mentioned lock contention, latency, and blocking. It can even analyze internal JVM operations such as SafePoint and JIT compilation.

JMC integrates the functionality of JFR. Now let’s introduce the use of JMC.

1. Recording #

The image below shows the result of a one-minute recording of Tomcat. You can access the corresponding performance interface from the menu bar on the left.

Drawing 4.png

JMC Recording Result Main Interface

By recording data, you can clearly understand the performance data of the operating system resources and the JVM internals during a certain minute.

2. Threads #

By selecting the corresponding thread, you can understand the execution status of the thread, such as Wait, Idle, Block, and other states and timing.

Taking the C2 compiler thread as an example, you can see the detailed hot classes and the code size after method inlining. As shown in the figure below, C2 is currently running frantically.

Drawing 5.png

JMC recording result - Threads interface

3. Memory #

Through the memory interface, you can see the memory allocation situation for each time period. This feature is very useful for troubleshooting memory overflow, memory leaks, and other situations.

Drawing 6.png

JMC recording result - Memory interface

4. Locks #

Lock information for some highly contested locks and deadlocks can be found in the Locks Information interface.

You can observe the specific IDs of some locks and the associated thread information, which can be analyzed in a correlated manner.

Drawing 7.png

JMC recording result - Locks Information interface

5. Files and Sockets #

The Files and Sockets interface can monitor read and write operations on I/O, and the interface is clear at a glance. If your application has heavy I/O operations, such as frequent logging or network read and write operations, you can monitor the corresponding information here and associate it with the execution stack.

Drawing 8.png

JMC recording result for Files and Sockets interface

6. Method Calls #

This is similar to the functionality of jvisualvm. It displays information and rankings of method calls. From here, you can see some high-cost methods and hot methods.

Drawing 9.png

JMC recording result: Method Calls

7. Garbage Collection #

Frequent garbage collection can negatively impact application performance. JFR provides detailed records of garbage collection, including when it occurred, which garbage collector was used, the duration of each garbage collection, and even the reasons behind it. All of this information can be found here.

Drawing 10.png

JMC recording result: Garbage Collection

8. JIT #

The code compiled by JIT will be executed very quickly, but it requires a compilation process. The compilation interface displays detailed information about the JIT compilation process, including the size of the generated CodeCache and method inlining information.

Drawing 11.png

JMC recording result: JIT information

9.TLAB #

By default, the JVM allocates a buffer area for each thread to accelerate object allocation. This is the concept of TLAB (Thread Local Allocation Buffer). This buffer is placed in the Eden space.

The principle is similar to the ThreadLocal class in the Java language, which avoids operations on the common area and reduces some lock contention. The interface shown in the figure below shows the allocation process in detail.

Drawing 12.png

TLAB information in JMC recording results

In the subsequent lessons, we will have multiple analysis cases that use this tool.

Arthas - Obtaining the Call Chain Time of a Single Request #

Arthas is a Java diagnostic tool that can investigate memory overflow, CPU spikes, high loads, and other issues. It can be considered as a collection of commands such as jstack and jmap.

Drawing 13.png

Arthas startup interface

Arthas supports many commands. Let’s take the trace command as an example.

Sometimes, when we find that the response time for a certain interface is very high and cannot identify the specific cause, we can use the trace command. This command records the execution of the entire chain of methods from the beginning and then calculates the performance cost of each node. Finally, it prints the result in a tree structure. Many performance issues can be easily identified.

Below is an example of the execution result.

$ trace demo.MathGame run
Press Q or Ctrl+C to abort.
Affect(class-cnt:1 , method-cnt:1) cost in 28 ms.
`---ts=2019-12-04 00:45:08;thread_name=main;id=1;is_daemon=false;priority=5;TCCL=sun.misc.Launcher$AppClassLoader@3d4eac69
    `---[0.617465ms] demo.MathGame:run()
        `---[0.078946ms] demo.MathGame:primeFactors() #24 [throws Exception]

`---ts=2019-12-04 00:45:09;thread_name=main;id=1;is_daemon=false;priority=5;TCCL=sun.misc.Launcher$AppClassLoader@3d4eac69
    `---[1.276874ms] demo.MathGame:run()
        `---[0.03752ms] demo.MathGame:primeFactors() #24 [throws Exception]

In the upcoming lessons, we will also provide examples to demonstrate how to identify the specific causes of problems.

wrk - Obtaining Performance Data of Web Interfaces #

wrk (click to view the GitHub website) is an HTTP load testing tool. Similar to the ab command, it is also a command-line tool.

Let’s take a look at its execution result.

Running 30s test @ http://127.0.0.1:8080/index.html
  12 threads and 400 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   635.91us    0.89ms  12.92ms   93.69%
    Req/Sec    56.20k     8.07k   62.00k    86.54%
  22464657 requests in 30.00s, 17.76GB read
Requests/sec: 748868.53
Transfer/sec:    606.33MB

As you can see, wrk provides statistics for common performance metrics, which is very useful for testing the performance of web services. Additionally, wrk supports Lua scripts to control functions such as setup, init, delay, request, and response, allowing better simulation of user requests.

Summary #

In order to obtain more performance data, we introduced the following five tools in this lesson.

nmon - Obtain system performance data
jvisualvm - Obtain JVM performance data
jmc - Obtain detailed performance data of Java applications
arthas - Obtain the call chain time of a single request
wrk - Obtain performance data of web interfaces

These tools cover low-level, application-specific, statistical, and detailed aspects. When locating performance issues, you need to flexibly use these tools to fully understand the attributes of the application from an overall perspective and identify performance bottlenecks from a detailed perspective. This enables comprehensive control over application performance.

These tools can help us identify system bottlenecks. Now, how do we analyze the optimization effect when optimizing the code? And how can we quickly and professionally test code snippets? In the next lesson, I will introduce “benchmark testing JMH” to answer these questions.