28 Jvm Problem Investigation and Analysis Part 2 Case Practices

28 JVM Problem Investigation and Analysis Part 2- Case Practices #

GC Troubleshooting Case Study #

In this section, we will look at an actual case study.

Let’s assume we have a high-concurrency service that uses the Spring Boot framework. We collect metrics using Micrometer and report them to the Datadog service.

Of course, Micrometer supports reporting data to various monitoring systems, such as AppOptics, Atlas, Datadog, Dynatrace, Elastic, Ganglia, Graphite, Humio, Influx, Instana, JMX, KairosDB, New Relic, Prometheus, SignalFx, Stackdriver, StatsD, Wavefront, and many more.

For more information on Micrometer, you can refer to:

https://micrometer.io/docs

Symptom Description #

Recently, we discovered through monitoring metrics that the maximum GC pause time of a service node frequently exceeds 400ms.

As shown in the following figure:

From the figure, we can see that the peak GC pause time reached 546ms. The displayed time point is around 09:20:00 on February 4, 2020.

The customer indicates that this situation must be resolved because the service’s timeout period is 1s. It is required that the maximum GC pause time does not exceed 200ms and the average pause time stays within 100ms. This issue has had a significant impact on the customer’s trading strategy.

CPU Load #

The CPU usage is shown in the following figure:

From the figure, we can see that the system load is 4.92 and the CPU utilization is around 7%. In fact, this figure implies some important clues, but we did not discover any problems at this time.

GC Memory Usage #

Next, we investigated the memory usage during this period:

From the figure, we can see that around 09:25, the usage of old_gen dropped significantly, indicating a FullGC event indeed occurred.

But before and after 09:20, the usage of the old generation space slowly increased and did not decrease. In other words, the FullGC event, which caused the maximum pause time, did not occur during this period.

However, these conclusions were reached through post-analysis. At that time, we didn’t fully trust the information provided by the monitoring system. We suspected that a FullGC event triggered a long GC pause.

Why did we have suspicions? Because the Datadog monitoring system reports data every 10 seconds by default. It’s possible that something happened within these 10 seconds but was missed (however, this is unlikely since any reporting failure would be logged in the system).

Analyzing the above figure again, we can see that the memory pool corresponding to the old generation is “ps_old_gen”. Based on our previous learning, we know that “ps” represents the ParallelGC garbage collector.

JVM Startup Parameters #

Checking the JVM startup parameters, we found the following:

-Xmx4g -Xms4g

We are using JDK 8, and the startup parameters do not specify any GC. This confirms that this service is using the default ParallelGC garbage collector.

Therefore, we suspect that the problem lies with this garbage collector. In many cases, ParallelGC sacrifices pause time for maximum system processing capacity, i.e., throughput, resulting in longer pause times.

Using the G1 Garbage Collector #

What can we do? We are considering switching to G1 because it is stable and performs well in the newer versions of JDK 8.

We then changed the startup parameters to the following:

# This parameter is incorrect and caused the startup failure
-Xmx4g -Xms4g -XX:+UseG1GC -XX:MaxGCPauseMills=50ms

Unfortunately, the startup failed due to a busy error. The parameter names and values were both incorrect.

Let’s correct it as follows:

-Xmx4g -Xms4g -XX:+UseG1GC -XX:MaxGCPauseMillis=50

Then, the service successfully started, and we waited for the health check to automatically switch to the new service node. Let’s continue monitoring the metrics.

Let’s take a look at the pause time. The GC pause time for each node has been reduced and is generally within 50ms, which is in line with our expectations.

Hmm! Is it the end of the story? Far from it.

“Easter Egg” Surprise #

After a while, we discovered a surprise (or maybe a shock), as shown in the following image:

We hit the jackpot, and after running for a while, the maximum GC pause time reached 1300ms.

The situation seems to have worsened.

Continuing to observe, we found that this is not an isolated case:

I was confused and thought that the metrics might have been calculated incorrectly, such as adding up all the pause times within 10 seconds.

Registering GC Event Listeners #

So I came up with a solution - registering GC event listeners through JMX to directly print out the relevant information.

The key code is as follows:

// Register listeners for each memory pool
for (GarbageCollectorMXBean mbean : ManagementFactory.getGarbageCollectorMXBeans()) {
    if (!(mbean instanceof NotificationEmitter)) {
        continue; // Skip if listener is not supported
    }
    final NotificationEmitter emitter = (NotificationEmitter) mbean;
    // Add listener
    final NotificationListener listener = getNewListener(mbean);
    emitter.addNotificationListener(listener, null, null);
}

Through this approach, we can listen to GC events in the program and summarize or output the relevant information to logs. The specific implementation code is provided in the later chapter “Challenges in the Container Era”.

After restarting, running for a while, and checking the log information, we saw:

{
"duration": 1869,
"maxPauseMillis": 1869,
"promotedBytes": "139MB",
"gcCause": "G1 Evacuation Pause",
"collectionTime": 27281,
"gcAction": "end of minor GC",
"afterUsage": {
  "G1 Old Gen": "1745MB",
  "Code Cache": "53MB",
  "G1 Survivor Space": "254MB",
  "Compressed Class Space": "9MB",
  "Metaspace": "81MB",
  "G1 Eden Space": "0"
 },
"gcId": 326,
"collectionCount": 326,
"gcName": "G1 Young Generation",
"type": "jvm.gc.pause"
}

The situation is indeed not good.

This time, it has been confirmed that it is not a Full GC, but a young generation GC, and the pause time reached 1869ms. This is completely unreasonable, and the CPU usage is also not high.

I looked up a lot of information trying to prove that this 1869ms is not the pause time, but only the difference between the end time and the start time of the GC event.

Printing GC Logs #

Since these methods are not reliable, our ultimate solution is to print GC logs.

Modify the startup parameters as follows:

-Xmx4g -Xms4g -XX:+UseG1GC -XX:MaxGCPauseMillis=50
-Xloggc:gc.log -XX:+PrintGCDetails -XX:+PrintGCDateStamps

Restart and hope that this time we can identify the cause of the problem.

After running for a while, we discovered a very long pause time.

Analyzing GC Logs #

Since it doesn’t involve sensitive data, let’s download the GC logs locally for analysis.

To locate the GC event where the pause time was excessively long, the key information is as follows:

Java HotSpot(TM) 64-Bit Server VM (25.162-b12) for linux-amd64 JRE (1.8.0_162-b12),
 built on Dec 19 2017 21:15:48 by "java_re" with gcc 4.3.0 20080428 (Red Hat 4.3.0-8)
Memory: 4k page, physical 144145548k(58207948k free), swap 0k(0k free)
CommandLine flags: 
 -XX:InitialHeapSize=4294967296 -XX:MaxGCPauseMillis=50 -XX:MaxHeapSize=4294967296 
 -XX:+PrintGC -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintGCTimeStamps 
 -XX:+UseCompressedClassPointers -XX:+UseCompressedOops -XX:+UseG1GC 

2020-02-24T18:02:31.853+0800: 2411.124: [GC pause (G1 Evacuation Pause) (young), 1.8683418 secs]
   [Parallel Time: 1861.0 ms, GC Workers: 48]
      [GC Worker Start (ms): Min: 2411124.3, Avg: 2411125.4, Max: 2411126.2, Diff: 1.9]
      [Ext Root Scanning (ms): Min: 0.0, Avg: 0.3, Max: 2.7, Diff: 2.7, Sum: 16.8]
      [Update RS (ms): Min: 0.0, Avg: 3.6, Max: 6.8, Diff: 6.8, Sum: 172.9]
         [Processed Buffers: Min: 0, Avg: 2.3, Max: 8, Diff: 8, Sum: 111]
      [Scan RS (ms): Min: 0.0, Avg: 0.2, Max: 0.5, Diff: 0.5, Sum: 7.7]
      [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 0.3]
      [Object Copy (ms): Min: 1851.6, Avg: 1854.6, Max: 1857.4, Diff: 5.8, Sum: 89020.4]
      [Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.6]
         [Termination Attempts: Min: 1, Avg: 1.0, Max: 1, Diff: 0, Sum: 48]
      [GC Worker Other (ms): Min: 0.0, Avg: 0.3, Max: 0.7, Diff: 0.6, Sum: 14.7]
      [GC Worker Total (ms): Min: 1858.0, Avg: 1859.0, Max: 1860.3, Diff: 2.3, Sum: 89233.3]
      [GC Worker End (ms): Min: 2412984.1, Avg: 2412984.4, Max: 2412984.6, Diff: 0.5]
   [Code Root Fixup: 0.0 ms]
   [Code Root Purge: 0.0 ms]
   [Clear CT: 1.5 ms]
   [Other: 5.8 ms]
      [Choose CSet: 0.0 ms]
      [Ref Proc: 1.7 ms]
      [Ref Enq: 0.0 ms]
      [Redirty Cards: 1.1 ms]
      [Humongous Register: 0.1 ms]
      [Humongous Reclaim: 0.0 ms]
      [Free CSet: 2.3 ms]
   [Eden: 2024.0M(2024.0M)->0.0B(2048.0K) 
    Survivors: 2048.0K->254.0M 
    Heap: 3633.6M(4096.0M)->1999.3M(4096.0M)]
 [Times: user=1.67 sys=14.00, real=1.87 secs]

The GC events before and after are normal, and no Full GC or concurrent marking cycle was found, but several suspicious points were found.

physical 144145548k(58207948k free): When the JVM started, the physical memory was 137GB, with 55GB of free memory.
[Parallel Time: 1861.0 ms, GC Workers: 48]: 48 garbage collector worker threads.

In our previous courses, we learned how to analyze GC logs, let’s review together.

user=1.67: User thread time took 1.67 seconds.
sys=14.00: System call and system wait time took 14 seconds.
real=1.87 secs: The actual pause time was 1.87 seconds.
Before GC, the young generation usage was 2GB, heap memory usage was 3.6GB, and the survival area was 2MB. It can be inferred that the old generation usage was approximately 1.6GB.
After GC, the young generation usage was 0, heap memory usage was 2GB, and the survival area was 254MB. Therefore, the old generation was around 1.8GB, resulting in a “memory promotion of about 200MB”.

After this analysis, the following conclusions can be drawn:

During the young generation transfer pause, around 400MB of objects were copied, but it took 1.8 seconds, and the system call and system wait time reached 14 seconds.
The JVM saw 137GB of physical memory.
It can be deduced that the JVM saw 72 CPU cores, since the GC worker threads are around 72 * 5/8 ~= 48.

Seeing so many GC worker threads makes me alert, especially since only 4GB of heap memory is allocated.

According to the common CPU and memory resource allocation ratio, it is usually around 4 cores 4GB, or 4 cores 8GB.

Let’s take a look at the corresponding CPU load monitoring information:

After communicating with the operations team, we found out that the configuration of this node is limited to 4 cores and 8GB.

So, the reason for the excessively long GC pause time has been identified.

K8S’s resource isolation and JVM are not well coordinated, resulting in the JVM seeing 72 CPU cores. The default setting for the parallel GC thread is 72* 5/8 ~= 48. However, K8S limits the Pod to only use 4 CPU cores, causing resource contention when the GC occurs and causing a large number of context switches.

The action plan is as follows:

Limit the number of parallel GC threads.

It has been proven that printing GC logs is indeed a useful troubleshooting and analysis method.

Limiting the Number of Parallel GC Threads #

Here is the new startup parameter configuration:

 -Xmx4g -Xms4g
 -XX:+UseG1GC -XX:MaxGCPauseMillis=50 -XX:ParallelGCThreads=4
 -Xloggc:gc.log -XX:+PrintGCDetails -XX:+PrintGCDateStamps

Here, -XX:ParallelGCThreads=4 is specified. Why is it configured like this? Let’s take a look at the explanation for this parameter:

-XX:ParallelGCThreads=n

Sets the number of parallel worker threads for the STW (Stop-the-World) phase. If the number of logical processors is less than or equal to 8, the default value n is equal to the number of logical processors.

If the number of logical processors is greater than 8, the default value n is approximately 5/8 of the processor count. In most cases, this is a reasonable value. If it is a high-configured SPARC system, the default value n is approximately 5/16 of the number of logical processors.

-XX:ConcGCThreads=n

Sets the number of GC threads for concurrent marking. The default value is approximately one-fourth of ParallelGCThreads.

Generally, there is no need to specify the number of GC threads for concurrent marking, only parallel is required.

After restarting, let’s take a look at the GC pause time indicator:

The point indicated by the red arrow is the time of the restart, and it can be seen that the pause time is mainly within the range of 50ms.

Subsequent monitoring found that this parameter indeed solved the problem.

So, are there any other methods? Please follow the next chapter, “Challenges in the Container Era.”

Summary #

Through this case, we can see that JVM problem troubleshooting and performance tuning are mainly based on monitoring data.

As the saying goes: Without quantification, there is no improvement.

Here is a summary of the approaches used here:

Metric monitoring
Specifying JVM startup memory
Specifying garbage collectors
Printing and analyzing GC logs

GC and memory are the most common JVM tuning scenarios. Do you still remember the performance dimensions of GC that we introduced at the beginning of the course?

Latency: The main factor affecting latency in GC is pause time.
Throughput: It mainly depends on the percentage of CPU resources consumed by business threads, including GC pause time and CPU resources consumed by concurrent GC under high load.
System capacity: It mainly refers to hardware configuration and service capabilities.

As long as these aspects of metrics can be met and various resource usage remains within a reasonable range, our expectations will be met.