19 Gc Log Reading and Analysis Case Analysis Part 2

19 GC Log Reading and Analysis Case Analysis Part 2 #

Understanding CMS GC Logs #

CMS, also known as “Concurrent Mark Sweep Garbage Collector,” is designed to avoid long pauses during garbage collection in the old generation. By default, the number of concurrent threads used by CMS is equal to one-fourth of the CPU cores.

You can specify the CMS garbage collector using the following option:

-XX:+UseConcMarkSweepGC

If CPU resources are limited, CMS may have lower throughput compared to parallel GC. Here is an example:

# Note that there is no line break in the command line for readability.
java -XX:+UseConcMarkSweepGC
-Xms512m
-Xmx512m
-Xloggc:gc.demo.log
-XX:+PrintGCDetails
-XX:+PrintGCDateStamps
demo.jvm0204.GCLogAnalysis

Similar to serial GC and parallel GC analyzed before, let’s start the program and see what the GC logs generated by the CMS algorithm look like:

Java HotSpot(TM) 64-Bit Server VM (25.162-b12) ...
Memory: 4k page, physical 16777216k(1168104k free)

CommandLine flags:
  -XX:InitialHeapSize=536870912 -XX:MaxHeapSize=536870912
  -XX:MaxNewSize=178958336 -XX:MaxTenuringThreshold=6
  -XX:NewSize=178958336 -XX:OldPLABSize=16 -XX:OldSize=357912576
  -XX:+PrintGC -XX:+PrintGCDateStamps
  -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
  -XX:+UseCompressedClassPointers -XX:+UseCompressedOops
  -XX:+UseConcMarkSweepGC -XX:+UseParNewGC

2019-12-22T00:00:31.865-0800: 1.067:
  [GC (Allocation Failure)
    2019-12-22T00:00:31.865-0800: 1.067:
    [ParNew: 136418K->17311K(157248K), 0.0233955 secs]
    442378K->360181K(506816K), 0.0234719 secs]
  [Times: user=0.10 sys=0.02, real=0.02 secs]

2019-12-22T00:00:31.889-0800: 1.091:
  [GC (CMS Initial Mark)
    [1 CMS-initial-mark: 342870K(349568K)]
    363883K(506816K), 0.0002262 secs]
  [Times: user=0.00 sys=0.00, real=0.00 secs]
2019-12-22T00:00:31.889-0800: 1.091:
  [CMS-concurrent-mark-start]
2019-12-22T00:00:31.890-0800: 1.092:
  [CMS-concurrent-mark: 0.001/0.001 secs]
  [Times: user=0.00 sys=0.00, real=0.01 secs]
2019-12-22T00:00:31.891-0800: 1.092:
  [CMS-concurrent-preclean-start]
2019-12-22T00:00:31.891-0800: 1.093:
  [CMS-concurrent-preclean: 0.001/0.001 secs]
  [Times: user=0.00 sys=0.00, real=0.00 secs]
2019-12-22T00:00:31.891-0800: 1.093:
  [CMS-concurrent-abortable-preclean-start]
2019-12-22T00:00:31.891-0800: 1.093:
  [CMS-concurrent-abortable-preclean: 0.000/0.000 secs]
  [Times: user=0.00 sys=0.00, real=0.00 secs]
2019-12-22T00:00:31.891-0800: 1.093:
  [GC (CMS Final Remark)
    [YG occupancy: 26095 K (157248 K)]
    2019-12-22T00:00:31.891-0800: 1.093:
      [Rescan (parallel), 0.0002680 secs]
    2019-12-22T00:00:31.891-0800: 1.093:
      [weak refs processing, 0.0000230 secs]
    2019-12-22T00:00:31.891-0800: 1.093:
      [class unloading, 0.0004008 secs]
    2019-12-22T00:00:31.892-0800: 1.094:
      [scrub symbol table, 0.0006072 secs]
    2019-12-22T00:00:31.893-0800: 1.095:
      [scrub string table, 0.0001769 secs]
      [1 CMS-remark: 342870K(349568K)]
    368965K(506816K), 0.0015928 secs]
  [Times: user=0.01 sys=0.00, real=0.00 secs]
2019-12-22T00:00:31.893-0800: 1.095:
  [CMS-concurrent-sweep-start]
2019-12-22T00:00:31.893-0800: 1.095:
  [CMS-concurrent-sweep: 0.000/0.000 secs]
  [Times: user=0.00 sys=0.00, real=0.00 secs]
2019-12-22T00:00:31.893-0800: 1.095:
  [CMS-concurrent-reset-start]
2019-12-22T00:00:31.894-0800: 1.096:
  [CMS-concurrent-reset: 0.000/0.000 secs]
  [Times: user=0.00 sys=0.00, real=0.00 secs]

This is just an excerpt from the GC logs. Compared to serial GC and parallel GC, the log information for CMS is more complex. This is partly because CMS has more detailed GC steps and partly because detailed GC logs expose more comprehensive and detailed information.

Analysis of Minor GC Logs #

The first few lines of the log are about cleaning the young generation through Minor GC:

2019-12-22T00:00:31.865-0800: 1.067:
  [GC (Allocation Failure)
    2019-12-22T00:00:31.865-0800: 1.067:
    [ParNew: 136418K->17311K(157248K), 0.0233955 secs]
    442378K->360181K(506816K), 0.0234719 secs]
  [Times: user=0.10 sys=0.02, real=0.02 secs]


Let's interpret together:

  1. `2019-12-22T00:00:31.865-0800: 1.067`: The start time of the GC event.
  2. `GC (Allocation Failure)`: A flag to distinguish between Minor GC and Full GC. `GC` indicates a "Minor GC", while `Allocation Failure` indicates the trigger of the GC. In this GC event, it was triggered because the young generation ran out of available space and the allocation of new objects failed.
  3. `[ParNew: 136418K->17311K(157248K), 0.0233955 secs]`: `ParNew` is the name of the garbage collector, which corresponds to the command-line flag `-XX:+UseParNewGC` mentioned in the previous logs. It indicates the "parallel mark-copy" garbage collector used in the young generation, specifically designed to work with the CMS garbage collector, as CMS only handles the old generation. The numbers after it represent the changes in young generation usage before and after the GC, as well as the total size of the young generation. `0.0233955 secs` is the time it took.
  4. `442378K->360181K(506816K), 0.0234719 secs`: Indicates the changes in heap memory usage before and after the GC, as well as the size of the heap memory. The time it took is `0.0234719 secs`, which is almost the same as the time in the ParNew section.
  5. `[Times: user=0.10 sys=0.02, real=0.02 secs]`: The duration of the GC event. `user` is the total CPU time consumed by the GC thread; `sys` is the time spent on system calls and system wait events; `real ~= (user + sys)/number of GC threads`. In this case, my machine has 4 cores and 8 threads, so the ratio is 6 times the actual time because there is always a certain proportion of the processing that cannot be executed in parallel.

Further calculations and analyses reveal that before the GC, the young generation usage was 136418K/157248K = 86%. The heap memory usage was 442378K/506816K = 87%. A rough estimate indicates that the old generation usage was (442378K-136418K)/(506816K-157248K) = 87%. These three ratios were all around 87% before the GC.

What about after the GC? The young generation usage is 17311K ~= 17%, which decreased by 119107K. The heap memory usage is 360181K ~= 71%, which decreased by only 82197K. Subtracting these two values gives us the amount of memory promoted from the young generation to the old generation: 119107 - 82197 = 36910K.

So how large is the old generation space? and what is the old generation usage? To answer these questions, dear reader, please use these numbers and do the calculation.

The memory change diagram for this GC is as follows:

![4438116.png](../images/13640c30-63ac-11ea-a283-8f19fc193c49)

Wow, this number is amazing. The old generation usage is 98%, which is very high. Immediately following is a Full GC log, so let's continue reading.

#### Full GC Log Analysis

In fact, this excerpt of the young generation GC log and the Full GC log are closely connected. We can see this roughly from the time interval, `1.067 + 0.02secs ~ 1.091`.

The CMS log is a completely different format and is quite long because CMS has its own log for each stage of garbage collection in the old generation. To keep it concise, we will introduce the log for each stage in order.

Let's start with the CMS Initial Mark for this Full GC:

2019-12-22T00:00:31.889-0800: 1.091: [GC (CMS Initial Mark) [1 CMS-initial-mark: 342870K(349568K)] 363883K(506816K), 0.0002262 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]

    2019-12-22T00:00:31.891-0800: 1.092:
      [CMS-concurrent-preclean-start]
    2019-12-22T00:00:31.891-0800: 1.093:
      [CMS-concurrent-preclean: 0.001/0.001 secs]
      [Times: user=0.00 sys=0.00,real=0.00 secs]
    2019-12-22T00:00:31.891-0800: 1.093:
      [CMS-concurrent-abortable-preclean-start]
    2019-12-22T00:00:31.891-0800: 1.093:
      [CMS-concurrent-abortable-preclean: 0.000/0.000 secs]
      [Times: user=0.00 sys=0.00,real=0.00 secs]
    2019-12-22T00:00:31.891-0800: 1.093:
      [GC (CMS Final Remark)
        [YG occupancy: 26095 K (157248 K)]
        2019-12-22T00:00:31.891-0800: 1.093:
          [Rescan (parallel), 0.0002680 secs]
        2019-12-22T00:00:31.891-0800: 1.093:
          [weak refs processing, 0.0000230 secs]
        2019-12-22T00:00:31.891-0800: 1.093:
          [class unloading, 0.0004008 secs]
        2019-12-22T00:00:31.892-0800: 1.094:
          [scrub symbol table, 0.0006072 secs]
        2019-12-22T00:00:31.893-0800: 1.095:
          [scrub string table, 0.0001769 secs]
          [1 CMS-remark: 342870K(349568K)]
        368965K(506816K), 0.0015928 secs]
      [Times: user=0.01 sys=0.00,real=0.00 secs]
    2019-12-22T00:00:31.893-0800: 1.095:
      [CMS-concurrent-sweep-start]
    2019-12-22T00:00:31.893-0800: 1.095:
      [CMS-concurrent-sweep: 0.000/0.000 secs]
      [Times: user=0.00 sys=0.00,real=0.00 secs]
    2019-12-22T00:00:31.893-0800: 1.095:
      [CMS-concurrent-reset-start]
    2019-12-22T00:00:31.894-0800: 1.096:
      [CMS-concurrent-reset: 0.000/0.000 secs]
      [Times: user=0.00 sys=0.00,real=0.00 secs]
    

During actual operation, CMS may perform multiple Young Generation (YG) Minor GCs while performing concurrent garbage collection in the Old Generation. In this case, the Full GC log may include multiple Minor GC events.

**Phase 1: Initial Mark**

As mentioned earlier, this phase is accompanied by a STW pause. The goal of the initial mark is to mark all root objects, including objects directly referenced by GC ROOT and objects referenced by all surviving objects in the Young Generation. The latter is important because the Old Generation is collected independently.

Let's look at the log for this phase:
    
    2019-12-22T00:00:31.889-0800: 1.091:
      [GC (CMS Initial Mark)
        [1 CMS-initial-mark: 342870K(349568K)]
        363883K(506816K), 0.0002262 secs]
[Times: user=0.00 sys=0.00,real=0.00 secs]
    
    

简单解读一下：

  1. `GC (CMS Final Remark)`：指示此阶段是最终标记阶段。
  2. `YG occupancy: 26095 K (157248 K)`：表示年轻代的占用情况，包括已使用的堆内存和可用的堆内存。
  3. `Rescan (parallel)`：表示重扫描过程是并行执行的。
  4. `0.0002680 secs`：重扫描的持续时间。
  5. `[Times: user=0.00 sys=0.00,real=0.00 secs]`：最终标记阶段的总时间，包括用户时间、系统时间和实际消耗的时间。



**阶段 6：Concurrent Sweep（并发清理）**


在并发清理阶段，CMS 垃圾收集器将不可达对象的内存空间回收，使其变为可用空间。

看看并发清理阶段的 GC 日志：
    
    
    2019-12-22T00:00:31.891-0800: 1.094:
      [CMS-concurrent-sweep-start]
    2019-12-22T00:00:31.891-0800: 1.094:
      [CMS-concurrent-sweep: 0.000/0.000 secs]
      [Times: user=0.00 sys=0.00,real=-0.00 secs]
    
    

简单解读：

  1. `CMS-concurrent-sweep`：表明这是并发清理阶段的日志，此阶段会清理不可达对象的内存空间。
  2. `0.000/0.000 secs`：此阶段的持续时间，分别是 GC 线程运行时间和实际占用的时间。
  3. `[Times: user=0.00 sys=0.00,real=-0.00 secs]`：Times 部分对并发阶段来说没多少意义，因为程序在并发阶段中持续运行，而时间为负数表示停顿比实际消耗的时间短。



**阶段 7：CMS Reset Free Space（重置空闲空间）**

此阶段是针对 CMS 而言的，目的是将所有的堆内存空间重新标记为空闲空间。

看看重置空闲空间阶段的 GC 日志：
    
    
    2019-12-22T00:00:31.891-0800: 1.094:
      [CMS-concurrent-reset-start]
    2019-12-22T00:00:31.891-0800: 1.094:
      [CMS-concurrent-reset: 0.000/0.000 secs]
    
    

简单解读：

  1. `CMS-concurrent-reset`：表明这是重置空闲空间阶段的日志，此阶段会将所有的堆内存空间重新标记为空闲空间。
  2. `0.000/0.000 secs`：此阶段的持续时间，分别是 GC 线程运行时间和实际占用的时间。


          [weak refs processing，0.0000230 secs]
        2019-12-22T00:00:31.891-0800: 1.093:
          [class unloading，0.0004008 secs]
        2019-12-22T00:00:31.892-0800: 1.094:
          [scrub symbol table，0.0006072 secs]
        2019-12-22T00:00:31.893-0800: 1.095:
          [scrub string table，0.0001769 secs]
          [1 CMS-remark: 342870K(349568K)]
        368965K(506816K)，0.0015928 secs]
      [Times: user=0.01 sys=0.00,real=0.00 secs]
    
    

Let's interpret this:

  1. `CMS Final Remark`: This is the name of this phase, which is the final marking phase. It marks all live objects in the old generation, including references created/modified during the previous concurrent marking phase.
  2. `YG occupancy: 26095 K (157248 K)`: The current usage and total capacity of the young generation.
  3. `[Rescan (parallel)，0.0002680 secs]`: A rescan is performed after the program pauses to complete the marking of live objects. This part is done in parallel and takes 0.0002680 seconds.
  4. `weak refs processing，0.0000230 secs`: The first sub-phase, which processes weak references, lasts for 0.0000230 seconds.
  5. `class unloading，0.0004008 secs`: The second sub-phase, which unloads unused classes, lasts for 0.0004008 seconds.
  6. `scrub symbol table，0.0006072 secs`: The third sub-phase, which cleans the symbol table that holds class-level metadata, lasts for 0.0006072 seconds.
  7. `scrub string table，0.0001769 secs`: The fourth sub-phase, which cleans the string tables associated with inline strings, lasts for 0.0001769 seconds.
  8. `[1 CMS-remark: 342870K(349568K)]`: The usage and total capacity of the old generation after this phase is completed.
  9. `368965K(506816K)，0.0015928 secs`: The usage and total capacity of the entire heap after this phase is completed.
  10. `[Times: user=0.01 sys=0.00,real=0.00 secs]`: The duration of the GC event.



After these 5 marking phases, all live objects in the old generation are marked, and the JVM will proceed to clear all unused objects to reclaim the space in the old generation.

**Phase 6: Concurrent Sweep**

This phase runs concurrently with the application and does not require a stop-the-world pause. Its purpose is to remove unreferenced objects and reclaim the memory space occupied by them.

Let's look at the GC logs for this part:
    
    2019-12-22T00:00:31.893-0800: 1.095:
      [CMS-concurrent-sweep-start]
    2019-12-22T00:00:31.893-0800: 1.095:
      [CMS-concurrent-sweep: 0.000/0.000 secs]
      [Times: user=0.00 sys=0.00,real=0.00 secs]
    

In summary:

  1. `CMS-concurrent-sweep`: This is the name of this phase, "Concurrent Sweep", where unreferenced objects in the old generation, i.e. objects that are no longer in use, are swept and memory is reclaimed.
  2. `0.000/0.000 secs`: The duration and actual time taken for this phase, which is a rounded value accurate to 3 decimal places.
  3. `[Times: user=0.00 sys=0.00,real=0.00 secs]`: The "Times" section doesn't have much relevance for concurrent phases as it is calculated from the start of the concurrent marking phase, during which not only the concurrent marking threads but also the program threads are running.
**Phase 7: Concurrent Reset**

This phase is executed concurrently with application threads to reset the internal data structures related to the CMS algorithm, so that they can be directly used in the next GC cycle.

The corresponding log is:

2019-12-22T00:00:31.893-0800: 1.095: [CMS-concurrent-reset-start] 2019-12-22T00:00:31.894-0800: 1.096: [CMS-concurrent-reset: 0.000/0.000 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]


Explanation:

1. `CMS-concurrent-reset`: The name of this phase, "Concurrent Reset", indicates that the internal data structures of the CMS algorithm are being reset to prepare for the next GC cycle.
2. `0.000/0.000 secs`: The duration and actual time taken for this phase.
3. `[Times: user=0.00 sys=0.00, real=0.00 secs]`: The "Times" section is not significant for concurrent phases because it is calculated from the start of concurrent marking, and during this time, both the GC thread and the application are running.

Now the question is, how much memory is used in the old generation after CMS? Unfortunately, this cannot be determined here and can only be analyzed through the later Minor GC logs.

For example, in this run, the later GC log is as follows:

2019-12-22T00:00:31.921-0800: 1.123: [GC (Allocation Failure) 2019-12-22T00:00:31.921-0800: 1.123: [ParNew: 153242K->16777K(157248K), 0.0070050 secs] 445134K->335501K(506816K), 0.0070758 secs] [Times: user=0.05 sys=0.00, real=0.00 secs]


Referring to the analysis method used in the previous Young Generation GC log, we can deduce that after the CMS Full GC, the old generation usage should be: 445134K-153242K=291892K, and the total capacity of the old generation is 506816K-157248K=349568K. Therefore, the percentage of old generation usage after the Full GC is 291892K/349568K=83%.

This is a relatively high percentage, indicating a problem. Generally, it means that the allocated memory is not enough, as we only specified a maximum heap memory of 512MB.

As usual, let's have a demonstration of memory usage before and after GC:

![image](../images/7fad0940-63ad-11ea-9c08-6f91e6eaabb6)

In short, the CMS garbage collector has made significant efforts to reduce pause time. A large portion of the GC threads run concurrently with application threads, eliminating the need to pause application threads, resulting in fewer pauses in general. However, CMS also has some disadvantages, the biggest of which is the fragmentation problem in the old generation. In certain cases, GC can have unpredictable pause times, especially when the heap memory is large.

> Here's a tip for learning about CMS: Refer to the diagrams in the previous phases and try drawing them yourself.

That concludes this section. In the next section, we will continue with the analysis of G1 logs.