35 Plus Lesson 5 Analyzing Java Issues Must Use These Tools Part Two

35 Plus Lesson 5 Analyzing Java Issues Must Use These Tools Part Two #

In the previous lesson, we introduced some built-in tools in the JDK and the network packet-capturing tool Wireshark to analyze and locate Java program issues. Many students, after watching the lecture, left comments saying “it opened up a new world for me, I never paid attention to the JVM before” and “using JVM tools helped me discover the cause of production OOM”.

Indeed, tools are essential aids that help us delve into the internals of frameworks and components, allowing us to understand their workings and principles. Therefore, we must make good use of them.

Today, I will continue to introduce how to use the JVM heap dump tool, MAT, to analyze OOM issues, as well as how to use the versatile troubleshooting tool, Arthas, to analyze and locate high CPU issues.

Analyzing OOM Issues Using MAT #

The best way to troubleshoot OOM issues and analyze heap memory usage is to analyze heap dumps.

A heap dump contains a complete snapshot of the heap and thread stack information (available since Java 6 Update 14). As we saw in the previous article, although tools like jstat can observe changes in heap memory usage, they provide little information about the number of objects and which ones are large objects inside the program, which means they can only observe the issue but not locate it. In contrast, a heap dump is like having an MRI scan of a patient at a specific moment, allowing for a detailed analysis.

Java’s OutOfMemoryError is a serious problem and requires identifying the root cause. Therefore, JVM parameters are commonly set for production applications to enable heap dumps in case of OOM:

-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=.

In the previous article, we mentioned the tool jvisualvm, which can also perform heap dumps and open them for analysis with one click. However, jvisualvm’s heap dump analysis feature is not very powerful; it can only view histograms of memory usage by class and cannot effectively trace memory usage reference relationships. Therefore, I recommend using Eclipse’s Memory Analyzer (MAT) for heap dump analysis. You can click this link to download MAT.

The typical approach for analyzing OOM issues using MAT is as follows:

Use the dominator tree or histogram functionality to identify the types that consume the most memory and analyze the probable cause of the memory leak.
Examine the types that consume the most memory, the detailed object reference lists, and their reference chains to pinpoint the specific points of the memory leak.
Use the object attributes view to see various attribute values and dependencies of objects from the heap dump without having to refer to the source code, helping to understand the program’s logic and parameters.
Optionally, use the thread stack view to determine if the OOM issue is related to excessive threads and even identify the thread that threw the exception at the last moment of OOM.

For example, I have a heap dump file named java_pid29569.hprof obtained after an OOM event. Now I want to analyze the cause of this OOM using MAT’s histogram, dominator tree, thread stack, and OQL (Object Query Language) capabilities.

First, after opening the heap dump file in MAT, we are presented with an overview information interface that shows the total heap size is 437.6MB.

Now, what are the objects that occupy this 437.6MB?

As shown in the figure, the second button in the toolbar opens the histogram, which groups objects by type and lists the number of instances and memory usage for each class. We can see that char[] arrays occupy the most memory and there are also many instances of the String class, which suggests that the program may be running out of memory due to excessive string usage (as strings use char[] arrays to store actual data).

Let’s continue our analysis to confirm if this is indeed the case.

Right-clicking on a char[] array and selecting “List objects -> with incoming references” will display all char[] instances along with their complete reference chains:

Expanding one of the char[] arrays randomly, the following screenshot shows the reference chain:

Next, let’s follow the reference chain in the red box to try to find the sources of these large char[]:

At ①, we can see that these char[] are almost 10,000 characters long, occupying around 20,000 bytes (since char is UTF-16 and each character occupies 2 bytes).

At ②, we can see that the char[] is referenced by the value field of the String, indicating that the char[] comes from a string.

At ③, we can see that the String is referenced by the elementData field of the ArrayList, indicating that these strings are added to an ArrayList.

At ④, we can see that the ArrayList is referenced by the data field of FooService, and the value of this ArrayList in the Retained Heap column is 431MB.

Retained Heap represents the memory occupied by an object itself and the objects associated with it, while Shallow Heap represents the memory occupied by the object itself. For example, in our FooService, the data ArrayList itself occupies only 16 bytes, but all the associated objects occupy a total of 431MB. This indicates that somewhere in the code, string data is continuously being added to this List, leading to an OOM error.

The blue box on the left allows us to view the internal properties of each instance. The image shows that FooService has a data property of type ArrayList.

If we want to see the complete content of a string, we can right-click and select Copy->Value to copy the value to the clipboard or save it to a file:

Here, we copied 10,000 characters of the letter ‘a’ (as shown in the red box below). In real cases, viewing the actual content of large strings or data is very useful for identifying the source of the data.

Based on these observations, we can now reconstruct the actual code.

In fact, when we previously used the histogram to locate FooService, we took a detour. You can click on the third button in the toolbar (as shown in the red box in the upper left of the image) to go to the dominator tree view (for more information on the concept of dominator trees, refer here). This view lists the objects that occupy the most memory in reverse order based on their retained heap.

As we can see, FooService is ranked first, with the path being FooService->ArrayList->Object[]->String->char[] (as shown in the blue box), and there are a total of 21,523 strings (as shown in the green box).

In this way, we have located FooService as the source from a memory perspective. So, what logic is FooService executing when the OOM occurs?

To solve this problem, we can click on the fifth button in the toolbar (as shown in the red box in the image) to open the thread view. The first thread we see is the main thread (in the Name column), and when expanded, we can indeed find FooService:

The methods executed earlier are at the top of the stack, so the top of the thread stack is the currently executing method, and by looking down one by one, we can see the entire call path. Since we want to understand the FooService.oom() method and see who is calling it and who it calls internally, we choose to analyze the call stack starting from the FooService.oom() method (shown in the blue box).

Looking down, the entire green box shows that the oom() method is called by the run method of OOMApplication, and this run method is called by the SpringApplication.callRunner method. Seeing the CommandLineRunner in the parameters, you should be able to deduce that OOMApplication actually implements the CommandLineRunner interface, so it is executed after the SpringBoot application starts.

Starting from FooService and looking up, from the Collectors and IntPipeline in the purple box, you can probably guess that these strings are generated by Stream operations. Looking further up, we can see that an OutOfMemoryError exception occurred during the append operation of StringBuilder (in the black box), indicating that this thread threw an OOM exception. We can see that the entire program is a Spring Boot application. Is FooService a Spring Bean? Is it a singleton? If we can analyze this, we can confirm that the OutOfMemoryError (oom) method of FooService is being called repeatedly, causing its internal ArrayList to continuously increase in size.

Click on the fourth button in the toolbar (as indicated by the red box in the figure below) to go to the OQL interface. In this interface, we can use syntax similar to SQL to search for data in the heap dump (you can directly search for “OQL Syntax” in the MAT help menu to see the detailed syntax of OQL).

For example, enter the following statement to search for instances of FooService:

SELECT * FROM org.geekbang.time.commonmistakes.troubleshootingtools.oom.FooService

You will see that there is only one instance. Then, we can search for objects that reference FooService using the “List objects” feature:

We get the following results:

We can see that there are two references in total:

The first reference is from OOMApplication, which we already know.

The second reference is from a ConcurrentHashMap. It can be seen that this HashMap is the singletonObjects field of DefaultListableBeanFactory, which confirms that FooService is a singleton Bean managed by the Spring container.

You can even right-click on this HashMap and select “Java Collections->Hash Entries” to view its contents:

This will list all the Beans, and you can further filter them using Regex on the Value. After entering FooService, you can see that there is only one Bean of type FooService with the name fooService:

So far, although we haven’t seen the code of the program, we already have a general understanding of the cause of the OOM and the approximate call stack. Let’s compare it with the code again, and it turns out that it is exactly the same as what we have seen:

@SpringBootApplication
public class OOMApplication implements CommandLineRunner {

    @Autowired
    FooService fooService;

    public static void main(String[] args) {
        SpringApplication.run(OOMApplication.class, args);
    }

    @Override
    public void run(String... args) throws Exception {
        // After the program starts, repeatedly call the `FooService.oom()` method
        while (true) {
            fooService.oom();
        }
    }
}

@Component
public class FooService {

    List<String> data = new ArrayList<>();

    public void oom() {
        // Continuously add a 10KB string to the same ArrayList
        data.add(IntStream.rangeClosed(1, 10_000)
                .mapToObj(__ -> "a")
                .collect(Collectors.joining("")));
    }
}

At this point, using the MAT tool, we have analyzed a heap dump of an OOM program from the perspectives of object listing, big objects, thread stacks, etc. It can be found that with a heap dump, we almost have access to the source code of the application + a snapshot at that moment, making it difficult for OOM problems to escape detection.

Analyzing High CPU Usage with Arthas #

Arthas is an open-source Java diagnostic tool developed by Alibaba. Compared to the diagnostic tools built into JDK, Arthas is more user-friendly and powerful. It allows for easy pinpointing of many issues and provides features like one-click decompilation of classes to view source code, and even hot-fixing of production code. It offers a one-stop solution for quickly locating and fixing problems. Today, I will show you how to use Arthas to identify a high CPU usage problem and learn how to use this tool.

First, download and start Arthas:

curl -O https://alibaba.github.io/arthas/arthas-boot.jar

java -jar arthas-boot.jar

After starting, directly find the JVM process you want to troubleshoot, and you will see that Arthas is attached successfully:

[INFO] arthas-boot version: 3.1.7
[INFO] Found existing java process, please choose one and hit RETURN.
* [1]: 12707
  [2]: 30724 org.jetbrains.jps.cmdline.Launcher
  [3]: 30725 org.geekbang.time.commonmistakes.troubleshootingtools.highcpu.HighCPUApplication
  [4]: 24312 sun.tools.jconsole.JConsole
  [5]: 26328 org.jetbrains.jps.cmdline.Launcher
  [6]: 24106 org.netbeans.lib.profiler.server.ProfilerServer
3
[INFO] arthas home: /Users/zhuye/.arthas/lib/3.1.7/arthas
[INFO] Try to attach process 30725
[INFO] Attach process 30725 success.
[INFO] arthas-client connect 127.0.0.1 3658
  ,---.  ,------. ,--------.,--.  ,--.  ,---.   ,---.
 /  O  \ |  .--. ''--.  .--'|  '--'  | /  O  \ '   .-'
|  .-.  ||  '--'.'   |  |   |  .--.  ||  .-.  |`.  `-.
|  | |  ||  |\  \    |  |   |  |  |  ||  | |  |.-'    |
`--' `--'`--' '--'   `--'   `--'  `--'`--' `--'`-----'
wiki      https://alibaba.github.io/arthas
tutorials https://alibaba.github.io/arthas/arthas-tutorials
version   3.1.7
pid       30725
time      2020-01-30 15:48:33

By typing the help command, you can see a list of all supported commands. Today, we will use the dashboard, thread, jad, watch, and ognl commands to locate the HighCPUApplication process. You can refer to the official documentation for a complete introduction to these commands.

The dashboard command is used to display overall information about the process, including threads, memory, and GC. Its output is as follows:

From the output, we can see that the high CPU usage is not caused by GC. Among the threads with high CPU usage, 8 of them are from ForkJoinPool.commonPool. If you have gone through the Extra Meal 1, you should know that ForkJoinPool.commonPool is the default thread pool used by parallel streams. Therefore, the high CPU usage is likely to be caused by code in a parallel stream.

Next, to view the thread stack of the busiest thread, you can use the thread -n command. Here, let’s look at the top 8 busiest threads:

thread -n 8

Output as follows:

As can be seen, because these threads are all performing MD5 operations, they are consuming a large amount of CPU resources. We hope to analyze which logic in the code may execute this operation, so we need to find our own classes from the method stack and focus on them.

Since the main thread also participates in the task processing of the ForkJoinPool, we can see the org.geekbang.time.commonmistakes.troubleshootingtools.highcpu.HighCPUApplication class’s doTask method by examining the stack of the main thread.

Next, use the jad command to directly decompile the HighCPUApplication class:

jad org.geekbang.time.commonmistakes.troubleshootingtools.highcpu.HighCPUApplication

As can be seen, the call path is main->task()->doTask(). When the doTask method receives an int parameter equal to a certain constant, it will perform 10,000 MD5 operations, which is the source of the CPU consumption. So, what is this magic value?

You may think that you can continue to use the jad command to view the User class. Here, because it is a demo, I did not provide very complex logic. In code with very complex business logic, the judgment logic cannot be so straightforward. We may also need to analyze what input parameter in doTask causes it to be “slow”.

At this time, we can use the watch command to observe the method’s input parameters. The following command means that we need to monitor the input parameters of the doTask method that take more than 100 milliseconds, and output the input parameters, expanding to a depth of 2 layers:

watch org.geekbang.time.commonmistakes.troubleshootingtools.highcpu.HighCPUApplication doTask '{params}' '#cost>100' -x 2

As can be seen, all the input parameters of the doTask methods that take a long time are 0, indicating that the User.ADMN_ID constant should be 0.

Finally, we use the ognl command to run an expression to directly query the User class’s ADMIN_ID static field to verify if it is indeed 0. The result obtained is indeed 0:

[arthas@31126]$ ognl '@org.geekbang.time.commonmistakes.troubleshootingtools.highcpu.User@ADMIN_ID'

@Integer[0]

It should be noted that because the monitor, trace, watch, and other commands are implemented through bytecode enhancement technology, they will insert some aspects into the methods of the specified class to achieve data statistics and observation. Therefore, after the diagnosis is completed, the shutdown command should be executed to restore the bytecode of the class or method, and then exit Arthas.

In this case, we used the Arthas tool to troubleshoot the high CPU problem:

First, by using the dashboard + thread command, you can quickly locate the problem in a few seconds, find the thread and method stack that consumes the most CPU.

Then, directly decompile the relevant code using the jad command to confirm the root cause.

In addition, if the called input parameters are not clear, you can use the watch command to observe the method’s input parameters and filter slow requests based on the method execution time.

As can be seen, using Arthas to locate production problems does not require original code or adding logs to help analyze input parameters. One tool can complete the entire process of locating and analyzing problems.

For application troubleshooting analysis, in addition to Alibaba Arthas, you can also pay attention to the Bistoury tool of Qunar. It provides a visual interface, can manage multiple machines, and even provides functions such as online breakpoint debugging, simulating the debugging experience of an IDE.

Key Review #

Finally, let me share a case with you.

Once, a developer encountered an OOM (Out of Memory) issue. After searching through the monitoring data, logs, and call chains for several hours, they were unable to locate the problem. However, when I received the heap dump file and opened the dominator tree, I immediately spotted the suspicious point. Mybatis was fetching millions of records with each query. By examining the thread stack, I quickly identified the method name where the bug occurred. When I went to the corresponding code, I found that a null parameter condition caused a full table scan. The entire troubleshooting process took less than 5 minutes.

From this case, we can see that by using the right tools and methods to analyze problems, we can almost always locate the root cause within a few minutes. Today, I introduced MAT, which is a powerful tool for analyzing Java heap memory issues, and Arthas, which is a powerful tool for quickly locating and analyzing production bugs in Java programs. By making good use of these two tools, we can help ourselves pinpoint production failures in minutes.

Reflection and Discussion #

When introducing the thread pool, we simulated two possible OOM situations: one using Executors.newFixedThreadPool and the other using Executors.newCachedThreadPool. Can you recall the cause of OOM? Assuming you don’t know the cause of OOM and you have obtained the heap dump after encountering these two OOM situations, can you try using MAT to analyze the heap dump and locate the problem?

Arthas also has a powerful hotfix feature. For example, when encountering high CPU problems, we located that the admin user executed MD5 many times, consuming a large amount of CPU resources. At this time, we can directly perform a hotfix on the server by following these steps: use the jad command to decompile the code -> modify the code directly using a text editor like Vim -> use the sc command to find the ClassLoader of the class where the code resides -> use the redefine command to hot update the code. Can you try using this process to directly fix the program (comment out the relevant code in the doTask method)?

In your day-to-day work, what other tools do you use to analyze and troubleshoot Java applications? I’m Zhu Ye. Please feel free to leave a comment in the comment section to share your thoughts. You are also welcome to share today’s content with your friends or colleagues for further discussion.