26 Advanced Tools for Complex Problem Solving the Stones of Other Mountains Can Be Used to Polish Jade

26 Advanced Tools for Complex Problem-Solving- The Stones of Other Mountains Can Be Used to Polish Jade #

In previous sections, we mentioned many JVM analysis tools. In this section, we will introduce several useful tools that you can use as needed.

OOM Killer #

In previous chapters, we briefly mentioned the OOM Killer (Out Of Memory killer) on Linux systems. When there is insufficient physical memory, Linux will find a “strong process” to kill.

OOM Killer parameter tuning #

Heap memory out of memory (OOM) in Java refers to the situation where the heap memory is full and the garbage collector cannot reclaim memory to allocate new objects.

On the other hand, out of memory (OOM) in the operating system refers to the situation where all the memory on the computer (physical memory + swap space) is fully utilized.

By default, this situation will trigger system alerts and stop normal operation. However, if the /proc/sys/vm/panic_on_oom parameter is set to 0, the operating system kernel will automatically invoke the OOM Killer function when a memory overflow occurs to kill the “strongest” process (Rogue Process), and the system may be able to continue running.

The following parameters can be set for individual processes to manually control which processes can be terminated by the OOM Killer. These parameters are located in the /proc/pid/ directory in the proc file system, where pid is the ID of the process.

oomadj: The normal range is -16 to 15, used to calculate a process’s OOM score (oomscore). The higher the score, the more likely the process will be killed by the OOM Killer. If set to -17, the process will be prevented from being killed by the OOM Killer.
The proc file system is a virtual file system, and when a process is killed, the /proc/pid/ directory is also destroyed.

Example of OOM Killer parameter adjustment #

For example, if the process has pid=12884, the root user can execute the following command:

$ cat /proc/12884/oom_adj
0

# View the final score
$ cat /proc/12884/oom_score
161

$ cat /proc/12884/oom_score_adj 
0

# Modify the score ...
$ echo -17 > /proc/12884/oom_adj

$ cat /proc/12884/oom_adj
-17

$ cat /proc/12884/oom_score
0
# View the adjusted score
$ cat /proc/12884/oom_score_adj 
-1000

# Modify the score
$ echo 15 > /proc/12884/oom_adj

$ cat /proc/12884/oom_adj
15

$ cat /proc/12884/oom_score
1160

$ cat /proc/12884/oom_score_adj 
1000

After configuring this, a process that consumes the most resources will not be killed when the operating system is low on memory, but instead, other processes will be killed first.

Case study #

We will use the following case study to demonstrate the OOM Killer.

1. Problem description

A Java application frequently crashes, and the suspected reason is that the Java process is being killed.

2. Configuration information

The configuration is as follows:

Server: Alibaba Cloud ECS
IP address: 192.168.1.52
CPU: 4 cores, virtual CPU Intel Xeon E5-2650 2.60GHz
Physical Memory: 8GB

3. Available Memory

Insufficient memory: 4 Java processes, 2.1+1.7+1.7+1.3=6.8G, have already occupied most of the memory.

4. View Logs

OOM Killer logs for Linux system:

sudo cat /var/log/messages | grep killer -A 2 -B 2

After investigation, the following logs were found:

$ sudo cat /var/log/messages | grep killer -A 2 -B 2
May 21 09:55:01 web1 systemd: Started Session 500687 of user root.
May 21 09:55:02 web1 systemd: Starting Session 500687 of user root.
May 21 09:55:23 web1 kernel: java invoked oom-killer: gfp_mask=0x201da，order=0，oom_score_adj=0
May 21 09:55:24 web1 kernel: java cpuset=/ mems_allowed=0
May 21 09:55:24 web1 kernel: CPU: 3 PID: 25434 Comm: java Not tainted 3.10.0-514.6.2.el7.x86_64 #1
--
May 21 12:05:01 web1 systemd: Started Session 500843 of user root.
May 21 12:05:01 web1 systemd: Starting Session 500843 of user root.
May 21 12:05:22 web1 kernel: jstatd invoked oom-killer: gfp_mask=0x201da，order=0，oom_score_adj=0
May 21 12:05:22 web1 kernel: jstatd cpuset=/ mems_allowed=0
May 21 12:05:23 web1 kernel: CPU: 2 PID: 10467 Comm: jstatd Not tainted 3.10.0-514.6.2.el7.x86_64 #1

It can be confirmed that the issue is indeed caused by insufficient physical memory.

Note: If the sum of -Xmx for all Java processes is larger than the remaining memory of the system, this kind of situation may occur.

Query the oom_score of all processes in the system:

ps -eo pid,comm,pmem --sort -rss | awk '{"cat /proc/"$1"/oom_score" | getline oom; print $0"\t"oom}'

Important reminder:

If you have adjusted the oom_adj configuration of a certain process, all processes created by that process will inherit the oom_score value. For example, if an sshd process is protected by OOM Killer, all SSH sessions will also be protected. In such configurations, if an OOM occurs, it may affect the functionality of OOM Killer in saving the system.

Let’s now consider a scenario where we want to debug and trace a system running online. What tools should we use? The following introduces 2 such tools.

BTrace Diagnostic Analysis Tool #

BTrace is a dynamic tracing tool based on the Java language that can be used to assist in problem diagnosis and analysis.

BTrace project page:

https://github.com/btraceio/btrace/

There are some simple introductions on the Wiki page:

BTrace is developed based on ASM, Java Attach API, and Instruments, and provides many annotations. Through these annotations, BTrace scripts can be written using Java code for read-only monitoring without the need to have a deep understanding of manipulating bytecode with ASM.

Let’s proceed with the actual operation.

BTrace Download #

Go to the Release page and find the latest compressed package to download:

After downloading, extract the package to use:

As seen, the bin directory contains executable files, and the samples directory contains script examples.

Sample Program #

Let’s start by writing a method with input parameters and a return value. The example is as follows:

package demo.jvm0209;
import java.util.UUID;
import java.util.concurrent.TimeUnit;
public class RandomSample {
    public static void main(String[] args) throws Exception {
        //
        int count = 10000;
        int seed = 0;
        for (int i = 0; i < count; i++) {
            seed = randomHash(seed);
            TimeUnit.SECONDS.sleep(2);
        }
    }
    public static int randomHash(Integer seed) {
        String uuid = UUID.randomUUID().toString();
        int hashCode = uuid.hashCode();
        System.out.println("prev.seed=" + seed);
        return hashCode;
    }
}

This sample program is very simple, looping many times to call a particular method, similar to other programs.

Then run the program, and you can see some outputs on the console every now and then:

prev.seed=1601831031
...

BTrace provides a command-line tool, but it is not as convenient as using it within JVisualVM. Below is a simple demonstration using the BTrace plugin integrated into JVisualVM.

Using BTrace in JVisualVM #

The installation of the BTrace plugin was explained in the previous section “Built-in JDK Graphic Interface Tools”.

Some attentive readers may have noticed that when installing the JVisualVM plugin, there is a plugin called “BTrace Workbench”. After installing this plugin, right-click on the corresponding JVM instance to enter the BTrace operating interface.

1. Installing the BTrace Plugin

Open VisualVM, select the menu “Tools - Plugins (G)”:

Then on the plugin installation interface, locate “Available Plugins”:

After checking “BTrace Workbench”, click the “Install (I)” button.

If the plugin is not displayed, please update your JDK to the latest version.

Follow the guidance and prompts to continue the installation.

Accept the agreement and click “Install”.

Wait for the installation to complete:

Click the “Finish” button.

Using the BTrace Plugin

When opened, the default interface looks like this:

You can see that it looks like a Java file. Then, referring to the official documentation, we add some scripts to it.

BTrace Script Example #

In the downloaded BTrace project, there are some script examples under the “samples” directory. Following these examples, we can write a simple BTrace script:

import com.sun.btrace.annotations.*;
import static com.sun.btrace.BTraceUtils.*;

@BTrace
public class TracingScript {
    @OnMethod(
            clazz = "/demo.jvm0209.*/",
            method = "/.*/"
    )
    // when method is entered
    public static void simple(
                @ProbeClassName String probeClass,
                @ProbeMethodName String probeMethod) {
        print("entered " + probeClass);
        println("." + probeMethod);
    }

 @OnMethod(clazz = "demo.jvm0209.RandomSample",
             method = "randomHash",
             location = @Location(Kind.RETURN)
    )
    // when method returns
 public static void onMethodReturn(
                @ProbeClassName String probeClass,
                @ProbeMethodName String probeMethod,
                @Duration long duration,
                @Return int returnValue) {
      print(probeClass + "." + probeMethod);
      print(Strings.strcat("(), duration=", duration+"ns;"));
      println(Strings.strcat(" return: ", ""+returnValue));
 }
}

Execution Result #

You can see that simple execution results are output:

You can compare them with the console output of the sample program.

More Examples #

BTrace provides many examples, and with some modifications, you can execute simple monitoring.

Arthas Diagnostic and Analysis Tool #

The BTrace mentioned above is very powerful but has some practical limitations and a variety of environment issues. So, are there any other better tools?

In this section, we will introduce the Arthas Diagnostic and Analysis Tool. The homepage of the Arthas project is:

https://github.com/alibaba/arthas

Introduction to Arthas #

Arthas is an open-source Java diagnostic tool launched by Alibaba, beloved by developers. Why? Here are some reasons:

Arthas supports JDK 6 and higher versions.
It supports Linux/Mac/Windows operating systems.
It uses command-line interactive mode and provides rich tab auto-completion, making it easy to locate and diagnose problems.
It supports WebConsole, which, in some complex situations, can be accessed by routing through HTTP. When we encounter the following similar problems and are at a loss, we can use Arthas to help us solve them:

Which jar file does this class load from? Why are there various class-related exceptions?
Why isn’t the code I modified executed? Could it be that I didn’t commit? Did I switch to the wrong branch?
When encountering a problem, can’t we debug it online? Do we have to add logs and redeploy?
When encountering a problem with the data processing of a specific user in the online environment, but unable to debug online, how can we reproduce it offline?
Is there a global perspective to view the running status of the system?
Is there a way to monitor the real-time running status of the JVM?
How to quickly locate the hotspots of an application and generate a flame graph?

The Arthas official website provides basic tutorials and advanced tutorials, as well as a web-based command-line simulator. Following the execution once makes it easy to understand the relevant functions.

Let’s experience it following the tutorial below.

Download and Installation #

First, let’s install Arthas:

# Prepare the directory
mkdir -p /usr/local/tools/arthas
cd /usr/local/tools/arthas
# Execute the installation script
curl -L https://alibaba.github.io/arthas/install.sh | sh
······

Start in command line:

# Start
./as.sh -h
Error: telnet is not installed. Try to use java -jar arthas-boot.jar
# Install telnet
# yum -y install telnet
# brew install telnet
# Execute again
./as.sh

Start through jar:

# Enter the jar directory
cd ~/.arthas/lib/3.1.7/arthas/
# Start arthas through jar
java -jar arthas-boot.jar

Usage Example #

After starting, the displayed information is roughly as shown in the following figure:

Then we enter the JVM process that needs to be connected (attached), for example, 1, and press Enter.

To exit, enter exit.

Then we enter the help command to view the help. The returned information is roughly as follows:

[arthas@27350]$ help
 NAME DESCRIPTION
 help Display Arthas Help
 keymap Display all the available keymap for the specified connection.
 sc Search all the classes loaded by JVM
 sm Search the method of classes loaded by JVM
 classloader Show classloader info
 jad Decompile class
 getstatic Show the static field of a class
 monitor Monitor method execution statistics, e.g. total/success/failure count, average rt, fail rate, etc.
 stack Display the stack trace for the specified class and method
 thread Display thread info, thread stack
 trace Trace the execution time of specified method invocation.
 watch Display the input/output parameter, return object, and thrown exception of specified method invocation
 tt Time Tunnel
 jvm Display the target JVM information
 ognl Execute ognl expression.
 mc Memory compiler, compiles java files into bytecode and class files in memory.
 redefine Redefine classes. @see Instrumentation#redefineClasses(ClassDefinition...)
 dashboard Overview of target jvm's thread, memory, gc, vm, tomcat info.
 dump Dump class byte array from JVM
 heapdump Heap dump
 options View and change various Arthas options
 cls Clear the screen
 reset Reset all the enhanced classes
 version Display Arthas version
 shutdown Shutdown Arthas server and exit the console
 stop Stop/Shutdown Arthas server and exit the console. Alias for shutdown.
 session Display current session information
 sysprop Display, and change the system properties.
 sysenv Display the system env.
 vmoption Display, and update the vm diagnostic options.
 logger Print logger info, and update the logger level
 history Display command history
 cat Concatenate and print files
 pwd Return working directory name
 mbean Display the mbean information
 grep grep command for pipes.
 profiler Async Profiler. https://github.com/jvm-profiling-tools/async-profiler

Here are the supported commands. To view the help information of a specific command, you can use the help xxx format. For example:

help thread

To view JVM information, enter the command jvm.

System environment variable sysenv:

View thread information, enter the thread command:

View the information of a specific thread:

[arthas@27350]$ thread 1
"main" Id=1 TIMED_WAITING
    at java.lang.Thread.sleep(Native Method)
    at java.lang.Thread.sleep(Thread.java:340)
    at java.util.concurrent.TimeUnit.sleep(TimeUnit.java:386)
    at demo.jvm0209.RandomSample.main(Unknown Source)

View JVM options vmoption:

Some options can be set. Here is an example vmoption PrintGCDetails true.

Search for classes sc:

Decompile code jad:

Dump heap memory heapdump:

Tracking method execution time with trace:

Observing method execution with watch:

As you can see, it supports conditional expressions, similar to conditional breakpoints in code debugging. It is very powerful and convenient to use as an integrated environment for JVM analysis, compared to general tools. For more features, please refer to Arthas User Documentation.

Profilers #

Here we introduce profilers.

Compared to the previous tools, profilers only focus on a part of GC. In this section, we will only briefly introduce the GC features related to profilers.

Note: Do not assume that profilers are suitable for all scenarios. Profilers can be very effective in certain situations, such as detecting CPU hotspots in code. However, using profilers may not always be the best solution in some cases.

This also applies to GC optimization. When detecting latency or throughput issues caused by GC, it is not necessary to use profilers. The tools mentioned earlier (jstat or native/visual GC logs) can better and faster detect whether there are GC problems. It is best not to use profilers when collecting performance data from production environments because they have a significant performance overhead and can impact running production systems.

If you do need to optimize GC, then profilers can be useful as they provide clear information about the creation of objects. Looking at it from a different perspective, if the reason for GC pauses is not due to a specific memory pool, it is likely because too many objects are being created. All profilers can track object allocations (via allocation profiling) and let you know which objects are actually resident in memory based on the memory allocation trajectory.

Allocation analysis can locate the locations where a large number of objects are created. The benefit of using profilers for GC optimization is being able to determine which types of objects occupy the most memory and which threads create the most objects.

Here we will introduce three allocation profilers: hprof, JVisualVM, and AProf. In fact, there are many profilers to choose from, including commercial products and free tools, but their features and applications are generally similar.

hprof #

The hprof profiler is built into the JDK. It can be used in various environments, so it is generally preferred.

Introduction to the performance analysis tool - HPROF:

https://github.com/cncounter/translation/blob/master/tiemao2017⁄20hprof/20_hprof.md

HPROF Reference Documentation:

https://docs.oracle.com/javase/8/docs/technotes/guides/troubleshoot/tooldescr008.html

To run hprof with the program, you need to modify the startup script, similar to this:

java -agentlib:hprof=heap=sites com.yourcompany.YourApplication

When the program exits, the allocation information will be dumped to the java.hprof.txt file in the working directory. Open it with a text editor and search for the keyword “SITES BEGIN”, and you will see:

SITES BEGIN (ordered by live bytes) Tue Dec 8 11:16:15 2015
    percent live alloc'ed stack class
    rank self accum bytes objs bytes objs trace name
    1 64.43% 4.43% 8370336 20121 27513408 66138 302116 int[]
    2 3.26% 88.49% 482976 20124 1587696 66154 302104 java.util.ArrayList
    3 1.76% 88.74% 241704 20121 1587312 66138 302115 eu.plumbr.demo.largeheap.ClonableClass0006
    ...
    
SITES END

From the above snippet, the allocations are sorted by the number of objects created. The first line shows that 64.43% of the objects are integer arrays (int[]) and created at the position identified as 302116. Searching for “TRACE 302116” shows:

TRACE 302116:
eu.plumbr.demo.largeheap.ClonableClass0006.<init>(GeneratorClass.java:11)
sun.reflect.GeneratedConstructorAccessor7.newInstance(<Unknown Source>:Unknown line)
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
java.lang.reflect.Constructor.newInstance(Constructor.java:422)

Now we know that 64.43% of the objects are integer arrays, created in the constructor of the ClonableClass0006 class at line 11. We can optimize the code to reduce the pressure on GC.

Java VisualVM #

In the first part of this chapter, we introduced JVisualVM as a tool for monitoring JVM’s GC behavior. In this section, we will discuss its application in allocation analysis.

JVisualVM connects to a running JVM through a GUI. After connecting to the target JVM:

Open the “Tools” -> “Options” menu, click on the Profiler tab, add a new configuration, select the Profiler Memory, and make sure to check “Record allocations stack traces”.
Check the “Settings” checkbox and modify the preset configuration under the Memory Settings tab.
Click the “Memory” button to start analyzing the memory.
Let the program run for a while to collect enough information about object allocation.
Click the “Snapshot” button below to obtain the collected snapshot information.

After completing the above steps, you will get information similar to this:

The image above is sorted by the number of objects created for each class. By looking at the first row, you can see that the most created objects are int[] arrays. Right-clicking on this row will show where these objects are created:

Compared to hprof, JVisualVM is easier to use – for example, in the screenshot above, you can see all the allocation information for int[] arrays in one place, making it easy to detect cases where multiple allocations are done in the same code location.

AProf #

AProf is an important analyzer developed by Devexperts. The memory allocation analyzer AProf is also packaged as a Java agent.

To analyze an application using AProf, you need to modify the JVM startup script, like this:

java -javaagent:/path-to/aprof.jar com.yourcompany.YourApplication

After restarting the application, an aprof.txt file will be generated in the working directory. This file is updated every minute and contains information like this:

 ========================================================================================================================
 TOTAL allocation dump for 91,289 ms (0h01m31s)
 Allocated 1,769,670,584 bytes in 24,868,088 objects of 425 classes in 2,127 locations
 ========================================================================================================================

 Top allocation-inducing locations with the data types allocated from them
 ------------------------------------------------------------------------------------------------------------------------
 eu.plumbr.demo.largeheap.ManyTargetsGarbageProducer.newRandomClassObject: 1,423,675,776 (80.44%) bytes in 17,113,721 (68.81%) objects (avg size 83 bytes)
  int[]: 711,322,976 (40.19%) bytes in 1,709,911 (6.87%) objects (avg size 416 bytes)
  char[]: 369,550,816 (20.88%) bytes in 5,132,759 (20.63%) objects (avg size 72 bytes)
  java.lang.reflect.Constructor: 136,800,000 (7.73%) bytes in 1,710,000 (6.87%) objects (avg size 80 bytes)
  java.lang.Object[]: 41,079,872 (2.32%) bytes in 1,710,712 (6.87%) objects (avg size 24 bytes)
  java.lang.String: 41,063,496 (2.32%) bytes in 1,710,979 (6.88%) objects (avg size 24 bytes)
  java.util.ArrayList: 41,050,680 (2.31%) bytes in 1,710,445 (6.87%) objects (avg size 24 bytes)
           ... cut for brevity ...

The above output is sorted by size. As you can see, 80.44% of the bytes and 68.81% of the objects are allocated in the ManyTargetsGarbageProducer.newRandomClassObject() method. Among them, int[] arrays occupy the largest memory size at 40.19%.

Scrolling down, you will find the allocation traces sorted by allocation size:

 Top allocated data types with reverse location traces
 ------------------------------------------------------------------------------------------------------------------------
 int[]: 725,306,304 (40.98%) bytes in 1,954,234 (7.85%) objects (avg size 371 bytes)
  eu.plumbr.demo.largeheap.ClonableClass0006.: 38,357,696 (2.16%) bytes in 92,206 (0.37%) objects (avg size 416 bytes)
   java.lang.reflect.Constructor.newInstance: 38,357,696 (2.16%) bytes in 92,206 (0.37%) objects (avg size 416 bytes)
    eu.plumbr.demo.largeheap.ManyTargetsGarbageProducer.newRandomClassObject: 38,357,280 (2.16%) bytes in 92,205 (0.37%) objects (avg size 416 bytes)
    java.lang.reflect.Constructor.newInstance: 416 (0.00%) bytes in 1 (0.00%) objects (avg size 416 bytes)
 ... cut for brevity ...

You can see that int[] arrays allocation continues to grow in the ClonableClass0006 constructor.

Like other tools, AProf reveals the allocation size and location information, making it easy to find the most memory-consuming parts.

In our opinion, AProf is a very useful allocation analyzer because it focuses only on memory allocation and does it well. Additionally, this tool is open source, free, and requires minimal resources.