14 Common Gc Algorithms Parallel, Cms, G1

14 Common GC Algorithms- Parallel, CMS, G1 #

After learning about the concepts of GC algorithms, we will now introduce the specific implementations of these algorithms in the JVM. First of all, it is important to remember that most JVMs require the use of two different GC algorithms - one for cleaning the young generation and another for cleaning the old generation.

We can choose from various built-in algorithms in the JVM. If the garbage collection algorithm is not explicitly specified through parameters, the default implementation for the corresponding JDK version will be used. This chapter will provide a detailed introduction to the implementation principles of various algorithms.

Serial GC #

Serial GC uses the mark-copy algorithm for the young generation and the mark-sweep-compact algorithm for the old generation.

Both of these are single-threaded garbage collectors and cannot perform parallel processing, so they will trigger a full pause (stop-the-world), stopping all application threads.

Therefore, this GC algorithm cannot fully utilize multi-core CPUs. Regardless of the number of CPU cores, the JVM can only use a single core for garbage collection.

To enable this collector, you only need to specify a JVM startup parameter that is effective for both the young generation and the old generation:

-XX:+UseSerialGC

This option is only suitable for JVMs with a few hundred MB of heap memory and is more useful for single-core CPUs.

For server applications, it is not recommended to use this GC algorithm unless it is necessary to limit the resources used by the JVM. Most server applications are deployed on multi-core platforms, and choosing Serial GC means artificially limiting the use of system resources, resulting in resource idle time and the inability to increase the throughput of business processing using the additional CPU resources.

We will provide detailed explanations on the log content of the Serial GC in the later chapter “GC Log Interpretation and Analysis”.

Parallel GC #

Parallel garbage collectors use the mark-copy algorithm for the young generation and the mark-sweep-compact algorithm for the old generation. Garbage collection in both the young and old generations will trigger a stop-the-world (STW) event, pausing all application threads for garbage collection. Both of them use multiple threads during the “mark and copy/compact” phases, hence the name “Parallel”. By executing in parallel, the GC time is greatly reduced.

The number of GC threads can be specified by the command-line parameter -XX:ParallelGCThreads=NNN, and its default value is the number of CPU cores. Any of the following sets of command-line parameters can be used to specify the parallel GC:

-XX:+UseParallelGC
-XX:+UseParallelOldGC
-XX:+UseParallelGC -XX:+UseParallelOldGC

Parallel garbage collectors are suitable for multi-core servers and their main goal is to increase throughput. As it effectively uses system resources, it can achieve higher throughput:

During GC, all CPU cores are used to clean up garbage in parallel, resulting in shorter total pause time.
In the interval between two GC cycles, no GC threads are running, so no system resources are consumed.

On the other hand, because all phases of this GC cannot be interrupted, it is prone to long pauses. If the main goal of the system is to minimize pause time/latency rather than maximize overall throughput, other combinations of garbage collectors should be chosen.

Note: Long pauses refer to the fact that once this GC is started, it completes all operations in one go, so the pause time for a single pause may be longer.

CMS Garbage Collector #

The official name of the CMS GC is “Mostly Concurrent Mark and Sweep Garbage Collector”. It uses the parallel STW (stop-the-world) approach with the mark-copy algorithm for the young generation and mainly uses the concurrent mark-sweep algorithm for the old generation.

The design goal of the CMS GC is to avoid long pauses during garbage collection in the old generation, mainly achieved through two means:

Firstly, the old generation is not compacted, but managed using free-lists to reclaim memory space.
Secondly, most of the work in the mark-and-sweep phase is performed concurrently with application threads. This means that there is no obvious application thread pause during these phases. However, it still competes with application threads for CPU time. By default, CMS uses a number of concurrent threads equal to 1/4 of the CPU cores.

The CMS garbage collector can be specified using the following option:

-XX:+UseConcMarkSweepGC

If the server has a multi-core CPU and the main tuning goal is to reduce system latency caused by GC pauses, using CMS is a wise choice. By reducing the pause time for each GC, the user experience of the system can be improved in many cases. Since CPU resources are often consumed by GC, CMS GC may have lower throughput than parallel GC in situations where CPU resources are limited (for most systems, the difference in throughput and latency should not be significant).

In practice, when performing concurrent garbage collection in the old generation, there may be multiple minor GC events in conjunction. In this case, the full GC log will contain multiple minor GC events, as described earlier. Now let’s take a look at the various phases of CMS GC.

Phase 1: Initial Mark #

This stage is accompanied by the suspension of STW. The goal of the initial marking phase is to mark all root objects, including objects directly referenced by root objects, as well as objects referenced by all live objects in the young generation (the old generation is collected separately).

Why doesn’t CMS collect the young generation? Didn’t we just finish a minor GC earlier? Collecting the young generation again probably won’t have much effect.

Take a look at the diagram:

Stage 2: Concurrent Mark #

In this stage, CMS GC traverses the old generation and marks all live objects starting from the root objects found in the “Initial Mark” phase. The “Concurrent Mark” phase runs concurrently with the application and does not require a pause. It is important to note that not all live objects in the old generation are marked in this phase, as object reference relationships may change during the marking process.

In the above diagram, a reference of the “currently processed object” is disconnected by the application thread, indicating a change in the object relationship (we will discuss how to handle this later).

Stage 3: Concurrent Preclean #

This stage is also executed concurrently with the application thread and does not require a pause.

Because the “Concurrent Mark” phase runs concurrently with the application, some reference relationships may have changed. If a reference relationship changes during the concurrent marking process, the JVM marks the region where the change occurred as “dirty” using “cards.” This is known as “Card Marking.”

In the precleaning phase, these dirty objects are counted and the objects they reference are also marked. After this phase is completed, the cards used for marking are cleared.

In addition, this phase also performs necessary detail processing and prepares for the Final Remark phase.

Stage 4: Concurrent Abortable Preclean #

This stage does not pause the application thread. The goal of this stage is to do as much work as possible before the STW Final Remark phase. The duration of this phase depends on various factors, as it performs the same tasks in a loop until a certain exit condition is met (such as iteration count, useful workload, consumed system time, etc.).

This stage can significantly impact the duration of the STW pause and has many important configuration options and failure modes.

Stage 5: Final Remark #

The Final Remark phase is the second (and final) STW pause in the GC event.

The goal of this phase is to complete the marking of all live objects in the old generation. Because the previous precleaning phase is executed concurrently, the GC thread may not keep up with the rate of modifications by the application. Therefore, an STW pause is needed to handle various complex situations.

Typically, CMS tries to execute the Final Remark phase when the young generation is as empty as possible to avoid triggering consecutive STW events.

After the five marking phases are completed, all live objects in the old generation are marked, and the GC will clear all unused objects to reclaim space in the old generation.

Stage 6: Concurrent Sweep #

This stage is executed concurrently with the application thread and does not require an STW pause. During this phase, the JVM deletes and recovers the memory space occupied by unused objects.

Stage 7: Concurrent Reset #

This stage is executed concurrently with the application thread and resets the internal data related to the CMS algorithm in preparation for the next GC cycle.

In conclusion, the CMS garbage collector has done a lot of complex and useful work to reduce pause time. The concurrent threads used for garbage collection can run simultaneously with the application threads without pausing them. Of course, CMS also has some drawbacks, with the biggest issue being memory fragmentation in the old generation (due to lack of compaction), and in some cases, GC can cause unpredictable pause times, especially when the heap memory is large.

G1 Garbage Collector #

G1 stands for Garbage-First, meaning it prioritizes the collection of regions with the most garbage.

The main design goal of G1 GC is to make STW pauses predictable and configurable.

In fact, G1 GC is a soft real-time garbage collector that allows you to set specific performance goals. For example, you can specify that within any xx-millisecond time range, STW pauses should not exceed yy milliseconds. For example, you can set a goal of no more than 5 milliseconds of pause time within any 1 second. G1 GC will strive to achieve this goal (there is a high probability of achieving it, but it is not completely guaranteed).

Characteristics of G1 GC #

To achieve the goal of predictable pause time, G1 GC has some unique implementations.

Firstly, the heap is no longer divided into young and old generations, but into multiple (usually 2048) small heap regions that can store objects. Each small region may be designated as an Eden region, a Survivor region, or an Old region at different times. Logically, all Eden and Survivor regions together form the young generation, and all Old regions together form the old generation, as shown in the following diagram:

With this division, G1 does not need to collect the entire heap space every time, but instead performs incremental processing: only a part of memory blocks, called the collection set, is processed during each GC pause. Each GC pause will collect all memory blocks in the young generation, but generally only includes a portion of the old generation, as indicated by the regions marked with checkmarks in the diagram.

Another innovation of G1 is to estimate the total number of live objects in each small heap block during the concurrent phase. The principle of building the garbage collection set is: the small block with the most garbage will be collected first. This is also the origin of the name G1.

Specify the G1 garbage collector through the following options:

-XX:+UseG1GC -XX:MaxGCPauseMillis=50

Commonly used parameters for G1 GC #

-XX:+UseG1GC: Enable G1 GC. JDK 7 and JDK 8 require explicit application to start G1 GC.
-XX:G1NewSizePercent: Initial young generation size as a percentage of the entire Java Heap. The default value is 5%.
-XX:G1MaxNewSizePercent: Maximum young generation size as a percentage of the entire Java Heap. The default value is 60%.
-XX:G1HeapRegionSize: Set the size of each region in MB. It must be one of the values 1, 2, 4, 8, 16, 32. The default is 1/2000 of the heap memory. If this value is set to a larger size, large objects can enter the region.
-XX:ConcGCThreads: The number of GC threads performing garbage collection in parallel with the Java application. The default is 1/4 of the Java threads. Reducing this value may improve parallel collection efficiency and increase system throughput. If this value is too low, insufficient threads participating in the garbage collection process may lead to longer parallel collection time.
-XX:+InitiatingHeapOccupancyPercent (abbreviated as IHOP): The threshold at which the G1 internal parallel collection cycle starts. The default value is 45% of the Java Heap. This can be understood as when the old generation usage is greater than or equal to 45%, the JVM will start garbage collection. This value is very important as it determines when to start parallel collection of the old generation.
-XX:G1HeapWastePercent: The minimum memory size at which G1 stops garbage collection. The default is 5% of the heap size. GC collects objects in all regions, but if it drops to 5%, it stops collecting. This means that all garbage does not need to be processed in each collection, and a small amount can be left for processing next time, thus reducing the time consumption per collection.
-XX:G1MixedGCCountTarget: Sets the number of mixed GCs to be performed after the parallel cycle. The default value is 8. The collection time of the old generation regions is usually longer than the collection time of the young generation. Therefore, if there are a lot of mixed collectors, G1 can extend the collection time of the old generation.
-XX:+G1PrintRegionLivenessInfo: This parameter needs to be started in conjunction with -XX:+UnlockDiagnosticVMOptions to print the live object information in each region of the JVM’s diagnostic information.
-XX:G1ReservePercent: G1 reserves some space for promotion between generations. The default value is 10% of the heap space. Because a large number of recycles occur in the young generation (shorter lifetime), if your application has a large heap space and many large objects survive, you need to reserve some memory here.
-XX:+G1SummarizeRSetStats: This is also a VM diagnostic information. If enabled, it will print a detailed summary of the RSets when the VM exits. If the -XX:G1SummaryRSetStatsPeriod parameter is enabled, RSets information will be printed periodically.
-XX:+G1TraceConcRefinement: This is also a VM diagnostic information. If enabled, the logs of the parallel collection phase will be printed in detail.
-XX:+GCTimeRatio: As we know, some stages of GC require Stop-the-World, i.e. stopping application threads. This parameter calculates the ratio of time spent on Java application threads to time spent on GC threads. The default is 9, consistent with the proportion of young generation memory allocation. The main purpose of this parameter is to allow users to control the time spent on applications. The calculation formula used in G1 is 100/(1+GCTimeRatio). So if the parameter is set to 9, at most 10% of the time will be spent on GC work. The default value for Parallel GC is 99, indicating that 1% of the time is used for GC work. This is because Parallel GC spans the entire GC, while G1 divides it based on regions and does not require a global scan of the entire heap.
-XX:+UseStringDeduplication: Enable the deduplication of Java String objects manually. This parameter was added after JDK8u20 and is mainly used to avoid duplicate memory allocation for identical strings, saving the use of regions.
-XX:MaxGCPauseMills: The expected pause time for each GC operation performed by G1, in milliseconds. The default value is 200ms. G1 will try to control pauses within this range by adjusting the operation time of each GC. If a GC takes longer than 50ms, for example 100ms, the system will automatically adjust the GC behavior later to float around 50ms.

The most important parameters in this list are:

-XX:+UseG1GC: Enable G1 GC.
-XX:+InitiatingHeapOccupancyPercent: Determine when G1 GC occurs.
-XX:MaxGCPauseMills: Expected pause time for each GC. For example, if we set it to 50, G1 GC will adjust the pause time of each GC operation to aim for pauses around 50ms. If a GC takes longer than 50ms, for example 100ms, the system will dynamically adjust the GC behavior to float around 50ms.

Evacuation Pause of Young Generation Mode Transfer #

From the previous analysis, it can be seen that G1 GC continuously adjusts its garbage collection strategy and behavior based on the running conditions. In the initial start of the application, G1 has not collected sufficient information, and it is in the initial fully-young mode. When the young generation space is full, the application threads are paused, and the surviving objects in the young generation are copied to the survivor space. If there is no survivor space, some idle memory blocks are arbitrarily selected as the survivor space.

The process of copying is called evacuation, which is the same working principle as other young generation collectors introduced earlier.

Concurrent Marking #

At the same time, we can also see that many concepts of G1 GC are built on the basis of CMS, so the following content requires a certain understanding of CMS.

The concurrent marking process of G1 is basically the same as CMS. The concurrent marking of G1 captures a snapshot of all live objects at the beginning of the marking phase using the “Snapshot-At-The-Beginning” approach. This snapshot is taken even if some objects become garbage during the marking process. Based on the survival information of objects, the survival status of each small heap block can be constructed to facilitate efficient selection by the garbage collector.

This information will be used to perform garbage collection in the old generation region in the next phase.

There are two cases that can be executed entirely in parallel:

If it is determined during the marking phase that there are no live objects in a small heap block, and it only contains garbage.
During the STW evacuation pause, old generation small heap blocks are found that contain both garbage and live objects.

When the overall usage ratio of the heap memory reaches a certain threshold, concurrent marking is triggered. The default threshold is 45%, but it can also be set through the JVM parameter “InitiatingHeapOccupancyPercent”. Like CMS, concurrent marking in G1 also consists of multiple phases, some of which are fully concurrent, while others pause the application threads.

Phase 1: Initial Mark

This phase marks all objects directly reachable from the GC root objects. In CMS, this requires a STW pause, but in G1, this is usually done during the evacuation pause, so the overhead is small.

Phase 2: Root Region Scan

This phase marks all live objects reachable from the “root regions”. Root regions include non-empty regions and regions that must be collected during the marking process.

Because migrating objects during the concurrent marking process can cause many problems, this phase must be completed before the next evacuation pause. If an evacuation pause must be initiated, it will first require the root region scan to be interrupted and wait for it to complete before continuing the scan. In the current implementation, the root regions are the live small heap blocks, including the young generation small heap blocks that will definitely be cleaned up in the next evacuation pause.

Phase 3: Concurrent Mark

This phase is very similar to the concurrent marking phase of CMS: it traverses the object graph and marks the objects that can be accessed in a special bitmap. To ensure the accuracy of the snapshot at the beginning of marking, all application threads concurrently update the object graph with references, and G1 requires discarding outdated references that were referenced for marking purposes in the earlier stages.

Stage 4: Remark

Similar to CMS, this is a STW pause (because it is not a concurrent phase) to complete the marking process.

The G1 collector briefly stops the application threads, stops writing concurrent update information, processes a small amount of information, and marks all surviving objects that were not marked at the beginning of the concurrent marking. This stage also performs some additional cleanup, such as reference processing or class unloading.

Stage 5: Cleanup

The final cleanup stage prepares for the upcoming evacuation phase, counts all surviving objects in small heap regions, and sorts the small heap regions to improve GC efficiency. This stage also performs all necessary housekeeping activities for the next marking: maintaining the internal state of concurrent marking.

All small heap regions that do not contain any surviving objects are reclaimed in this stage. Some tasks are concurrent, such as the reclamation of empty regions and most of the liveliness calculation. This stage also requires a short STW pause to complete the tasks without interference from the application threads.

Evacuation Pause (mixed) #

After the concurrent marking is completed, G1 performs a mixed collection, which involves not only cleaning up the young generation but also including some old generation regions in the collection set. The mixed mode evacuation pause does not necessarily immediately follow the concurrent marking stage. The startup time of the mixed mode is influenced by many rules and historical data. For example, if a lot of small heap regions can be freed concurrently in the old generation, there is no need to start the mixed mode. Therefore, there may be multiple young mode evacuation pauses between concurrent marking and mixed mode evacuation pauses.

The specific size and order of the old generation regions added to the collection set are determined based on various rules. This includes specified soft real-time performance indicators, liveliness, GC efficiency data collected during the concurrent marking, and some configurable JVM options. The process of mixed collection is largely similar to the earlier fully-young GC.

Introduction to Remembered Sets

Remembered sets are used to support the independent reclamation of different small heap regions.

For example, when reclaiming small heap regions A, B, and C, we need to know if there are references from region D or E pointing to them in order to determine their liveness. However, traversing the entire heap would take a considerable amount of time, which contradicts the purpose of incremental collection, so some optimization methods must be adopted. Similar to the “card” method used to support garbage collection in the young generation in other GC algorithms, G1 uses Remembered Sets.

As shown in the following figure, each small heap region has a Remembered Set that lists all references pointing to it from outside. These references are treated as additional GC roots. Note that objects in the old generation that are determined to be garbage during the concurrent marking process will be ignored even if there are external references pointing to them, because in this case, the referrers are also garbage (such as references between garbage objects or cyclic references).

The subsequent actions are the same as other garbage collectors: multiple GC threads determine which objects are live and which are garbage in parallel:

Finally, the surviving objects are evacuated to survivor regions and new small heap regions are created if necessary. Now, the empty small heap regions are freed and can be used to store new objects.

Summary of Experience in Choosing GC #

Through the study of this section, you should have gained some understanding of the G1 garbage collector. Of course, for the sake of conciseness, we have omitted many implementation details, such as how to handle “humongous objects”.

Overall, G1 is the most advanced production-ready garbage collector in HotSpot JVM before JDK 11. It is important to note that the main focus of HotSpot engineers is to continuously improve G1. In updated versions of JDK, more powerful features and optimizations will be introduced.

As we can see, G1, as a replacement for CMS, solves various difficult problems in CMS, including the predictability of pause times and the elimination of heap fragmentation. For systems that are highly sensitive to single-task latency and have no CPU resource limitations, G1 can be said to be the best choice in HotSpot, especially in the latest versions of JVM. However, this latency optimization is not without cost: due to additional write barriers and daemon threads, G1 incurs greater overhead. If the system prioritizes throughput or CPU usage remains consistently at 100%, and if the pause time of a single GC is not a concern, then CMS is a better choice.

In summary, G1 is suitable for scenarios that require large memory and low latency.

The only feasible way to choose the correct GC algorithm is to try it out and identify any areas that are not reasonable. Here are some general guidelines:

If the system prioritizes throughput and all CPU resources are used to process the business logic as much as possible, use Parallel GC.
If the system prioritizes low latency and aims for short GC times, use CMS GC.
If the system has a large heap size and wants to control the average GC time overall, use G1 GC.

Considering the memory size:

Generally, if it is larger than 4G, G1 has a higher cost-effectiveness.
Generally, if it exceeds 8G, such as 16G-64G of memory, it is highly recommended to use G1 GC.

Lastly, let’s discuss a question that many developers often overlook, and is frequently asked in interviews at major companies:

What is the default GC in JDK 8?

Many people may think it is CMS or even G1, but neither is correct.

The answer is: the default GC strategy in JDK 8 is Parallel GC.

Note that G1 became the default GC strategy in JDK 9, and the combination of ParNew + SerialOld is not supported.

The next section will introduce new GC strategies in higher JDK versions (ZGC, Shenandoah, etc.).