18 Gc Log Reading and Analysis Case Analysis Part 1

18 GC Log Reading and Analysis Case Analysis Part 1 #

The previous section described the basic information and configuration of the GC log.

It should be noted that these parameters are based on JDK 8 configuration.

In versions after JDK 9, there have been some changes in startup parameters, and using the previous parameter configuration may result in startup errors. However, there is no need to worry, as usually corresponding remedies and solutions can be found from the error messages.

For example, in JDK 11, the startup script for printing info-level GC logs is as follows:

# JDK 11 environment, output info-level GC logs
java -Xms512m -Xmx512m
-Xlog:gc*=info:file=gc.log:time:filecount=0
demo.jvm0204.GCLogAnalysis

Starting from JDK 9, you can use the command java -Xlog:help to check the log parameters supported by the current JVM. This article does not provide a detailed introduction, but interested students can refer to JEP 158: Unified JVM Logging and JEP 271: Unified GC Logging.

In addition, JMX technology provides a notification mechanism for GC events. An example program that listens to GC events will be provided in the chapter “Challenges Faced in the Era of Containers”.

However, in many cases, the GC data reported in JMX notification events is not complete; it is only a rough summary.

The GC log is the most reliable and comprehensive source of information for understanding the JVM and garbage collector, as it contains many details. Once again, analyzing the GC log is a valuable skill that can help us better troubleshoot performance issues.

Next, we will analyze and interpret the GC log through practical operation.

Interpretation of Serial GC Log #

Please refer to the previous article, “Introduction to Common GC Algorithms,” for an introduction to the serial garbage collector.

First, in order to enable GC log recording, we use the following JVM startup parameters:

# Please note that there is no line break at the command prompt, this is manually formatted
java -XX:+UseSerialGC
-Xms512m -Xmx512m
-Xloggc:gc.demo.log
-XX:+PrintGCDetails
-XX:+PrintGCDateStamps
demo.jvm0204.GCLogAnalysis

Let us take a look at the garbage collection log of the Serial GC and extract information from it.

Enabling the serial garbage collector, the GC log output after the program execution is similar to the following (manually wrapped for easy reading):

Java HotSpot(TM) 64-Bit Server VM (25.162-b12) ......
Memory: 4k page, physical 16777216k (1551624k free)
CommandLine flags:
  -XX:InitialHeapSize=536870912 -XX:MaxHeapSize=536870912
  -XX:+PrintGC -XX:+PrintGCDateStamps
  -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
  -XX:+UseCompressedClassPointers -XX:+UseCompressedOops
  -XX:+UseSerialGC
2019-12-15T15:18:36.592-0800: 0.420:
  [GC (Allocation Failure)
    2019-12-15T15:18:36.592-0800: 0.420:
    [DefNew: 139776K->17472K(157248K), 0.0364555 secs]
    139776K->47032K(506816K),
    0.0365665 secs]
  [Times: user=0.02 sys=0.01, real=0.03 secs]
......
2019-12-15T15:18:37.081-0800: 0.908:
  [GC (Allocation Failure)
    2019-12-15T15:18:37.081-0800: 0.908:
    [DefNew: 156152K->156152K(157248K), 0.0000331 secs]
    2019-12-15T15:18:37.081-0800: 0.908:
    [Tenured: 299394K->225431K(349568K), 0.0539242 secs]
    455546K->225431K(506816K),
    [Metaspace: 3431K->3431K(1056768K)],
    0.0540948 secs]
  [Times: user=0.05 sys=0.00, real=0.05 secs]

The first line of the log displays the JVM version information. The part between the second line and the first timestamp shows memory pagination, physical memory size, command line parameters, and other information. This part has been explained before and will not be repeated here.

Upon careful observation, we can see that two GC events occurred in this log. One of them cleaned up the young generation, and the other cleaned up the entire heap. Let’s start by analyzing the first young generation GC event.

Minor GC Log Analysis #

The log content corresponding to this young generation GC event:

2019-12-15T15:18:36.592-0800: 0.420:
[GC (Allocation Failure)
 2019-12-15T15:18:36.592-0800: 0.420:
 [DefNew: 139776K->17472K(157248K), 0.0364555 secs]
 139776K->47032K(506816K),
 0.0365665 secs]
[Times: user=0.02 sys=0.01, real=0.03 secs]

From this, we can interpret the following information:

  1. 2019-12-15T15:18:36.592-0800: The timestamp when the GC event started. “-0800” indicates that the current time zone is GMT-8, which is just an identifier for us to intuitively determine the timing of the GC. The “0.420” is the interval of the GC event relative to the JVM startup time, measured in seconds.
  2. GC: Used to distinguish between Minor GC and Full GC. “GC” indicates that this is a minor GC, specifically a young generation GC. “Allocation Failure” indicates the cause of triggering the GC. In this GC event, it is triggered because object allocation failed, and there was no space in the young generation to store the newly generated objects.
  3. DefNew: Indicates the name of the garbage collector. This name indicates a young generation collector that uses a single thread, a mark-copy algorithm, and stops-the-world garbage collector. 139776K->17472K indicates the usage of the young generation before and after garbage collection. (157248K) indicates the total size of the young generation. Further analysis reveals that the young generation utilization rate is 11% after GC.
  4. 139776K->47032K(506816K) indicates the usage of the entire heap memory before and after garbage collection. (506816K) indicates the total available space of the heap memory. Further analysis reveals that the heap memory utilization after GC is 9%.
  5. 0.0365665 secs: The duration of the GC event, measured in seconds.
  6. [Times: user=0.02 sys=0.01, real=0.03 secs]: The duration of this GC event is measured by three parts. The user part represents the CPU time consumed by all GC threads. The sys part represents the time consumed by system calls and system wait events. The real part represents the application pause time. Since the Serial Garbage Collector only uses a single thread, real=user+system, and 0.03 seconds is equal to 30 milliseconds.

Based on experience, this pause time is acceptable for most systems, but it is not ideal for some latency-sensitive systems, such as real-time game services and high-frequency trading businesses. The 30ms pause caused by this may be fatal.

After interpreting this, we can analyze the memory usage and changes of the JVM in the GC event.

Before this garbage collection, the total memory usage of the heap was 139776K, with the young generation using 139776K. From this, we can calculate that the usage of the old generation space before GC was 0. (In fact, this is the first record in the GC log)

These numbers contain more important information:

  • Comparing before and after GC, the usage of the young generation decreased from 139776K to 17472K, a reduction of 122304K.
  • However, the total memory usage of the heap decreased from 139776K to 47032K, a reduction of only 92744K.

From this, we can calculate that the objects promoted from the young generation to the old generation occupy a memory space of “122304K-92744K=29560K”. Of course, another set of numbers can also be used to deduce the usage of the old generation after GC: 47032K-17472K=29560K.

Summary:

From the analysis, students should have noticed that the main focus is on two data points: GC pause time, and memory usage/usage rate after GC.

The diagram of this GC event is shown below:

57974076.png

Full GC Log Analysis #

After analyzing the first GC event, we should have a general pattern in mind. Let’s take a look at the log for another GC event:

2019-12-15T15:18:37.081-0800: 0.908:
  [GC (Allocation Failure)
    2019-12-15T15:18:37.081-0800: 0.908:
    [DefNew: 156152K->156152K(157248K),0.0000331 secs]
    2019-12-15T15:18:37.081-0800: 0.908:
    [Tenured: 299394K->225431K(349568K),0.0539242 secs]
    455546K->225431K(506816K),
    [Metaspace: 3431K->3431K(1056768K)],
    0.0540948 secs]
  [Times: user=0.05 sys=0.00,real=0.05 secs]

From this, we can interpret the following information:

  1. 2019-12-15T15:18:37.081-0800: The start time of the GC event.
  2. [DefNew: 156152K->156152K(157248K),0.0000331 secs]: We have already interpreted this. Due to memory allocation failure, a young generation GC occurred. The same DefNew collector was used for this GC. Note that this garbage collection took 0.0000331 seconds, confirming that the young generation was not heavily processed in this GC event.
  3. Tenured: The name of the garbage collector used for cleaning up the old generation space. “Tenured” indicates the use of a single-threaded STW garbage collector using the “mark-sweep-compact” algorithm. 299394K->225431K(349568K) represents the usage of the old generation before and after GC, as well as the size of the old generation space. 0.0539242 secs is the time taken to clean up the old generation.
  4. 455546K->225431K(506816K): The usage of the entire heap memory before and after GC, as well as the available heap space size.
  5. [Metaspace: 3431K->3431K(1056768K)]: Changes in Metaspace. It can be observed that there was no significant change in Metaspace during this GC process.
  6. [Times: user=0.05 sys=0.00,real=0.05 secs]: The duration of the GC event, divided into user, sys, and real parts. Since the serial garbage collector uses only a single thread, “real=user+system”. The pause time of 50 milliseconds increased by about twice compared to the previous young generation GC. What does this time have to do with? The answer is: GC time has the greatest correlation with the total number of surviving objects after GC.

Further analyzing these data, the usage rate of the old generation after GC is: 225431K/349568K=64%. This ratio is not small, but it does not necessarily indicate a problem since the memory usage decreased after GC. Further observation is needed…

Compared to the young generation GC, a notable difference is that this GC event cleaned up the old generation and Metaspace.

Summary:

For FullGC, we mainly focus on whether the memory usage decreases after GC, and secondly on the pause time. As a rough estimate, after GC, the memory usage of the old generation is about 220MB and it takes 50ms. If the memory is increased by 10 times, the memory usage of the old generation after GC will also increase by 10 times, and the time may be 500ms or even higher, which will have a significant impact on the system. This is one of the reasons why we say that the performance of the serial GC is weak, and it is generally not used on the server side.

The memory change during this GC event can be represented by the following diagram:

839273.png

If there is almost no change in the young generation, what should we do? Because there are actually other GC log records in the context, we can interpret them according to this format.

Parallel GC Log Interpretation #

The parallel garbage collector uses the “mark-copy” algorithm for the young generation and the “mark-sweep-compact” algorithm for the old generation.

Garbage collection in both the young and old generations triggers a stop-the-world (STW) event, pausing all application threads to perform garbage collection. Multiple threads are used during the “mark” and “copy/compact” phases, hence the name “Parallel”.

By executing in parallel with multiple GC threads, the JVM can greatly reduce the GC time on multi-CPU platforms.

The number of GC threads can be specified by the command line parameter -XX:ParallelGCThreads=NNN, with the default value being the number of CPU cores.

The following three sets of command line parameters are equivalent and can be used to specify the parallel garbage collector:

-XX:+UseParallelGC
-XX:+UseParallelOldGC
-XX:+UseParallelGC -XX:+UseParallelOldGC

Example:

# Please note that there is no line break in the command line
java -XX:+UseParallelGC
-Xms512m -Xmx512m
-Xloggc:gc.demo.log
-XX:+PrintGCDetails
-XX:+PrintGCDateStamps
demo.jvm0204.GCLogAnalysis

The parallel garbage collector is suitable for multi-core servers and its main goal is to increase system throughput (i.e., reduce the overall time consumed by GC). To achieve this goal, it utilizes as many CPU resources as possible:

  • During the GC event, all CPU cores clean up garbage in parallel, resulting in relatively shorter pause times.
  • During the interval between two GC events, no GC threads are started, so no system resources are consumed during this time.

On the other hand, because all phases of the parallel GC cannot be interrupted, it is likely to experience long pause times.

“Long pause time” means that once the parallel GC starts, it completes all GC operations at once, resulting in a longer pause time.

If system latency is a critical performance metric, then another garbage collector should be chosen.

Let’s execute the above command line and see what the GC log of the parallel garbage collector looks like:

CommandLine flags:
-XX:InitialHeapSize=536870912 -XX:MaxHeapSize=536870912
-XX:+PrintGC -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
-XX:+UseCompressedClassPointers -XX:+UseCompressedOops
-XX:+UseParallelGC
......
2019-12-18T00:37:47.463-0800: 0.690:
  [GC (Allocation Failure)
    [PSYoungGen: 104179K->14341K(116736K)]
    383933K->341556K(466432K),0.0229343 secs]
  [Times: user=0.04 sys=0.08,real=0.02 secs]
2019-12-18T00:37:47.486-0800: 0.713:
  [Full GC (Ergonomics)
    [PSYoungGen: 14341K->0K(116736K)]
    [ParOldGen: 327214K->242340K(349696K)]
    341556K->242340K(466432K),
    [Metaspace: 3322K->3322K(1056768K)],
  0.0656553 secs]
  [Times: user=0.30 sys=0.02,real=0.07 secs]
......

If the generated GC log and phases are different, please try running the command multiple times, since we are using random numbers.

Analysis of Minor GC Logs #

The previous GC event occurred in the young generation during a Minor GC:

2019-12-18T00:37:47.463-0800: 0.690:
  [GC (Allocation Failure)
    [PSYoungGen: 104179K->14341K(116736K)]
    383933K->341556K(466432K),0.0229343 secs]
  [Times: user=0.04 sys=0.08,real=0.02 secs]

The interpretation is as follows:

  1. 2019-12-18T00:37:47.463-0800: 0.690: The start time of the GC event.
  2. GC: A flag used to distinguish between a Minor GC and a Full GC. In this case, it is a “Minor GC”.
  3. PSYoungGen: The name of the garbage collector. This name indicates the use of parallel “mark-copy” stop-the-world (STW) garbage collector in the young generation. 104179K->14341K(116736K) represents the young generation usage before and after GC, as well as the total size of the young generation. The calculated usage rate of the young generation after GC is 14341K/116736K=12%.
  4. 383933K->341556K(466432K) represents the usage of the entire heap memory before and after GC, as well as the total size of the available heap. The usage rate of the heap memory after GC is 341556K/466432K=73%. This usage rate is not low, and in fact, there has been a Full GC before this Minor GC, although it is not mentioned here.
  5. [Times: user=0.04 sys=0.08,real=0.02 secs]: The duration of the GC event measured in three parts. “user” represents the total CPU time consumed by the GC thread, “sys” represents the time consumed by operating system calls and system wait events, and “real” represents the actual pause time of the application. Since not all operations can be completely parallelized, in Parallel GC, “real” is approximately equal to (user+system)/number of GC threads. On the author’s machine with 8 physical threads, the default number of GC threads is 8. Analyzing this time, it can be observed that if the serial GC were used, the pause time could be around 120 milliseconds, but with parallel GC, it only took 20 milliseconds, significantly improving performance.

From this log, we can calculate the following: before the GC, the total memory usage of the heap was 383933K, with the young generation occupying 104179K. Therefore, we can calculate that the usage of the old generation is 279754K.

After the completion of this GC, the usage of the young generation decreased by 104179K-14341K=89838K, and the total heap memory usage decreased by 383933K-341556K=42377K.

Thus, we can calculate that 89838K-42377K=47461K objects were promoted from the young generation to the old generation. The usage of the old generation is 341556K-14341K=327215K.

The size of the old generation is 466432K-116736K=349696K, and the usage rate is 327215K/349696K=93%, which means it is almost full.

Summary:

For a young generation GC, we can focus on the pause time and whether the memory usage rate after the GC is normal. There is no need to pay special attention to the usage before the GC. As long as the application is running, object allocation and therefore collection in the young generation are inevitable, and the amount of garbage collected will not be small.

The memory change diagram for this GC event is:

8353526.png

Analysis of Full GC Logs #

Let’s take a look at the GC log for cleaning up the entire heap, after previously discussing the GC log for cleaning up the young generation in parallel:

2019-12-18T00:37:47.486-0800: 0.713:
  [Full GC (Ergonomics)
    [PSYoungGen: 14341K->0K(116736K)]
    [ParOldGen: 327214K->242340K(349696K)]
    341556K->242340K(466432K),
    [Metaspace: 3322K->3322K(1056768K)],
  0.0656553 secs]
  [Times: user=0.30 sys=0.02, real=0.07 secs]

Let’s interpret it:

  1. 2019-12-18T00:37:47.486-0800: The start time of the GC event.
  2. Full GC: The indicator of a full GC. Full GC indicates that this GC cleans up both the young and old generations. Ergonomics is the trigger reason for the GC, indicating that the JVM internal environment determines that garbage collection can be performed at this time.
  3. [PSYoungGen: 14341K->0K(116736K)]: Similar to the previous example, the garbage collector for cleaning up the young generation is called “PSYoungGen”, which is a stop-the-world (STW) collector using the “mark-copy” algorithm. The usage of the young generation changes from 14341K to 0K, which is usually the case in a full GC for the young generation.
  4. ParOldGen: The garbage collector type used for cleaning up the old generation space. Here, the garbage collector used is called “ParOldGen,” which is a parallel STW garbage collector that uses the “mark-sweep-compact” algorithm. 327214K->242340K(349696K): The usage of the old generation memory before and after GC, as well as the size of the old generation space. A simple calculation shows that the usage rate of the old generation before GC is 327214K/349696K=93%, and after GC, the usage rate is 242340K/349696K=69%, indicating a considerable amount of memory has been reclaimed. How much memory has been promoted to the old generation is difficult to calculate in a full GC, whereas it is easier to calculate in a minor GC.
  5. 341556K->242340K(466432K): The usage of the heap memory before and after garbage collection, as well as the total capacity of the heap memory. A simple analysis shows that the usage rate of the heap memory before GC is 341556K/466432K=73%, and after GC, the usage rate is 242340K/466432K=52%.
  6. [Metaspace: 3322K->3322K(1056768K)]: As seen before, this provides similar information about the Metaspace. It can be observed that no objects have been reclaimed in the Metaspace in this GC event.
  7. 0.0656553 secs: The duration of the GC event in seconds.
  8. [Times: user=0.30 sys=0.02, real=0.07 secs]: The duration of the GC event, as explained before.

The difference between Full GC and Minor GC is evident. In this GC event, besides handling the young generation, the old generation and Metaspace are also cleaned up.

Summary:

During a Full GC, we pay more attention to whether the usage of the old generation has decreased, and by how much. If the memory does not decrease significantly after Full GC, and the usage rate is still high, it indicates a problem with the system.

The memory change diagram for this GC event is as follows:

85130696.png

Observant readers may notice that this Full GC event and the previous Minor GC event are close to each other in time: 0.690+0.02 secs to 0.713. This is because the usage of the old generation reached 93% after the Minor GC, triggering the Full GC immediately thereafter.

This concludes this section. In the next section, we will continue analyzing the CMS GC log.