30 Gc Difficult Situation Problem Investigation and Analysis Part 2

30 GC Difficult Situation Problem Investigation and Analysis Part 2 #

Weak, Soft, and Phantom References #

Another type of issue that affects garbage collection (GC) is the presence of non-strong references in the program. While these references can often help avoid OutOfMemoryErrors in many cases, excessive usage can severely impact GC and degrade system performance.

Disadvantages of Weak References #

Firstly, weak references are subject to forced GC. When the garbage collector finds a weakly reachable object (meaning it is only reference by weak references), it places the object in the appropriate ReferenceQueue and turns it into a finalizable object. The reference queue can then be traversed, and the necessary cleanup can be performed. A typical example is clearing out keys in a cache that are no longer referenced.

Of course, at this point, we can assign a new strong reference to the object. Just before finalization and disposal, the GC will reconfirm if the object can be safely collected. Therefore, the process of garbage collecting weak reference objects spans multiple GC cycles.

In practice, weak references are used in many scenarios. Most caching frameworks are implemented based on weak references, so even if the business code does not directly use weak references, they are likely to be present in the program.

Secondly, soft references are even harder for the garbage collector to collect compared to weak references. There is no exact timing for collecting soft references; it is left to the JVM to decide. Usually, soft references are only collected as a last resort when available memory is running out. This means that there may be more frequent Full GCs, and the pause time may be longer than expected due to the abundance of surviving objects in the old generation.

Lastly, when using phantom references, manual memory management is required to determine if these objects can be safely collected. On the surface, it may seem normal, but it is not the case. The javadoc states:

In order to ensure that a reclaimable object remains so, the referent of a phantom reference may not be retrieved: The get method of a phantom reference always returns null.

为了防止可回收对象的残留，虚引用对象不应该被获取：phantom reference 的 get 方法返回值永远是 null.

Surprisingly, many developers overlook the next paragraph (this is the key):

Unlike soft and weak references, phantom references are not automatically cleared by the garbage collector as they are enqueued. An object that is reachable via phantom references will remain so until all such references are cleared or themselves become unreachable.

与软引用和弱引用不同，虚引用不会被 GC 自动清除，因为他们被存放到队列中。通过虚引用可达的对象会继续留在内存中，直到调用此引用的 clear 方法，或者引用自身变为不可达。

This means that we must manually call the clear() method to clear the phantom reference; otherwise, an OutOfMemoryError may occur and cause the JVM to crash. The reason for using phantom references is it is the only conventional way to track when an object becomes unreachable through programming. Unlike soft and weak references, we cannot “resurrect” phantom-reachable objects.

Example #

Let’s take a look at an example of weak references, where a large number of objects are created and collected during a Minor GC. Similar to before, we can modify the promotion threshold. You can use the following JVM parameters:

-Xmx24m -XX:NewSize=16m -XX:MaxTenuringThreshold=1

The GC log in this case will look like this:

2.330: [GC (Allocation Failure)  20933K->8229K(22528K), 0.0033848 secs]
2.335: [GC (Allocation Failure)  20517K->7813K(22528K), 0.0022426 secs]
2.339: [GC (Allocation Failure)  20101K->7429K(22528K), 0.0010920 secs]
2.341: [GC (Allocation Failure)  19717K->9157K(22528K), 0.0056285 secs]
2.348: [GC (Allocation Failure)  21445K->8997K(22528K), 0.0041313 secs]
2.354: [GC (Allocation Failure)  21285K->8581K(22528K), 0.0033737 secs]
2.359: [GC (Allocation Failure)  20869K->8197K(22528K), 0.0023407 secs]
2.362: [GC (Allocation Failure)  20485K->7845K(22528K), 0.0011553 secs]
2.365: [GC (Allocation Failure)  20133K->9501K(22528K), 0.0060705 secs]
2.371: [Full GC (Ergonomics)     9501K->2987K(22528K), 0.0171452 secs]

As you can see, there are very few Full GCs. However, if we use weak references to refer to the created objects, using the JVM parameter -Dweak.refs=true, the situation will change significantly. There are many reasons to use weak references, such as when an object is used as a key in a weak hash map. In any case, using weak references can lead to the following scenario:

2.059: [Full GC (Ergonomics)  20365K->19611K(22528K), 0.0654090 secs]
2.125: [Full GC (Ergonomics)  20365K->19711K(22528K), 0.0707499 secs]
2.196: [Full GC (Ergonomics)  20365K->19798K(22528K), 0.0717052 secs]
2.268: [Full GC (Ergonomics)  20365K->19873K(22528K), 0.0686290 secs]
2.337: [Full GC (Ergonomics)  20365K->19939K(22528K), 0.0702009 secs]
2.407: [Full GC (Ergonomics)  20365K->19995K(22528K), 0.0694095 secs]

As you can see, there are multiple Full GCs occurring, and the GC time has increased by an order of magnitude compared to the previous example!

This is another example of premature promotion, but this time, the situation is more tricky: the root cause of the problem lies in weak references. These dying objects, after weak references are added, are promoted to the old generation. However, they are now trapped in another GC cycle and require proper cleanup.

As before, the simplest solution is to increase the size of the young generation, for example, using the JVM parameters -Xmx64m -XX:NewSize=32m: 2.328: [GC (Allocation Failure) 38940K->13596K(61440K), 0.0012818 secs] 2.332: [GC (Allocation Failure) 38172K->14812K(61440K), 0.0060333 secs] 2.341: [GC (Allocation Failure) 39388K->13948K(61440K), 0.0029427 secs] 2.347: [GC (Allocation Failure) 38524K->15228K(61440K), 0.0101199 secs] 2.361: [GC (Allocation Failure) 39804K->14428K(61440K), 0.0040940 secs] 2.368: [GC (Allocation Failure) 39004K->13532K(61440K), 0.0012451 secs]

At this time, the objects are garbage collected during Minor GC.

In the worst case, using soft references, for example, in this soft reference example program. If the program is not about to throw an OutOfMemoryError, the soft referenced objects will not be garbage collected. In the example program, when replacing the weak references with soft references, more Full GC events immediately occur:

2.162: [Full GC (Ergonomics) 31561K->12865K(61440K), 0.0181392 secs] 2.184: [GC (Allocation Failure) 37441K->17585K(61440K), 0.0024479 secs] 2.189: [GC (Allocation Failure) 42161K->27033K(61440K), 0.0061485 secs] 2.195: [Full GC (Ergonomics) 27033K->14385K(61440K), 0.0228773 secs] 2.221: [GC (Allocation Failure) 38961K->20633K(61440K), 0.0030729 secs] 2.227: [GC (Allocation Failure) 45209K->31609K(61440K), 0.0069772 secs] 2.234: [Full GC (Ergonomics) 31609K->15905K(61440K), 0.0257689 secs]

The most interesting thing is that the result of the phantom reference example is very similar to the weak reference example when using the same JVM parameters. In fact, there are significantly fewer Full GC pauses, because they have different ways of finalization.

If disabling phantom reference clearing by adding the JVM startup parameter -Dno.ref.clearing=true, you can see:

4.180: [Full GC (Ergonomics) 57343K->57087K(61440K), 0.0879851 secs] 4.269: [Full GC (Ergonomics) 57089K->57088K(61440K), 0.0973912 secs] 4.366: [Full GC (Ergonomics) 57091K->57089K(61440K), 0.0948099 secs]

The main thread quickly throws an exception:

java.lang.OutOfMemoryError: Java heap space

Be cautious and promptly clean up phantom reachable objects when using phantom references. If not cleaned up, an OutOfMemoryError is likely to occur.

Trust our experience: if the exception in the reference queue handling thread is not caught, the system will soon go down.

Effects of Non-Strong References #

It is recommended to use the JVM parameter -XX:+PrintReferenceGC to see the impact of various references on GC. If this parameter is used for starting the weak reference example, you will see:

2.173: [Full GC (Ergonomics) 2.234: [SoftReference, 0 refs, 0.0000151 secs] 2.234: [WeakReference, 2648 refs, 0.0001714 secs] 2.234: [FinalReference, 1 refs, 0.0000037 secs] 2.234: [PhantomReference, 0 refs, 0 refs, 0.0000039 secs] 2.234: [JNI Weak Reference, 0.0000027 secs] [PSYoungGen: 9216K->8676K(10752K)] [ParOldGen: 12115K->12115K(12288K)] 21331K->20792K(23040K), [Metaspace: 3725K->3725K(1056768K)], 0.0766685 secs] [Times: user=0.49 sys=0.01, real=0.08 secs] 2.250: [Full GC (Ergonomics) 2.307: [SoftReference, 0 refs, 0.0000173 secs] 2.307: [WeakReference, 2298 refs, 0.0001535 secs] 2.307: [FinalReference, 3 refs, 0.0000043 secs] 2.307: [PhantomReference, 0 refs, 0 refs, 0.0000042 secs] 2.307: [JNI Weak Reference, 0.0000029 secs] [PSYoungGen: 9215K->8747K(10752K)] [ParOldGen: 12115K->12115K(12288K)] 21331K->20863K(23040K), [Metaspace: 3725K->3725K(1056768K)], 0.0734832 secs] [Times: user=0.52 sys=0.01, real=0.07 secs] 2.323: [Full GC (Ergonomics) 2.383: [SoftReference, 0 refs, 0.0000161 secs] 2.383: [WeakReference, 1981 refs, 0.0001292 secs] 2.383: [FinalReference, 16 refs, 0.0000049 secs] 2.383: [PhantomReference, 0 refs, 0 refs, 0.0000040 secs] 2.383: [JNI Weak Reference, 0.0000027 secs] [PSYoungGen: 9216K->8809K(10752K)] [ParOldGen: 12115K->12115K(12288K)] 21331K->20925K(23040K), [Metaspace: 3725K->3725K(1056768K)], 0.0738414 secs] [Times: user=0.52 sys=0.01, real=0.08 secs]

Only after determining how the GC affects the throughput and latency of the application, should you spend time analyzing and reviewing this part of the log. In most cases, the number of references cleaned up by each GC is very small, often zero.

If the GC spends a lot of time cleaning up these references or cleans up a large number of references, further observation and analysis are required.

Solution #

If the program is indeed encountering issues such as mis- or ab- or abusing weak/soft/phantom references, the implementation logic of the program generally needs to be modified. Since each system is different, it is difficult to provide general guidance, but there are some common practices:

Weak references: If the usage of a certain memory pool increases and causes performance issues, increase the size of this memory pool (you may also need to increase the maximum capacity of the heap memory). As seen in the example, increasing the size of the heap memory and the size of the young generation can alleviate the symptoms.
Soft references: If you determine that soft references are the root cause of the problem, the only solution is to modify the source code of the program to change the internal implementation logic.
Phantom references: Ensure that the clear method of the phantom reference is called in the program. It is easy to overlook certain phantom references in programming, or the clearing speed may not keep up with the production speed, or the thread clearing the reference queue might fail. These situations can put a lot of pressure on the GC, and ultimately may cause an OutOfMemoryError.

Other Performance Issue Scenarios #

The previous section described the most common GC performance issues. This section discusses some less common issues that can still cause system failures.

RMI and GC #

If the system provides or consumes RMI services, the JVM will periodically perform a Full GC to ensure that unused objects locally do not occupy space on the other end. Even if your code does not publish RMI services, third-party libraries or toolkits may still open RMI terminals. The most common culprit is JMX, which will use RMI to publish data when connected to a remote end.

The problem is that there are many unnecessary periodic Full GCs. When checking the usage of the old generation, there is generally no memory pressure and there are still a large number of idle areas, but the Full GC is still triggered, which will pause all application threads.

This periodic invocation of System.gc() to delete remote references is done in the sun.rmi.transport.ObjectTable class, through the invocation of sun.misc.GC.requestLatency(long gcInterval).

For many applications, it is simply unnecessary and even detrimental to performance. To disable this periodic GC behavior, the following JVM parameters can be used:

java -Dsun.rmi.dgc.server.gcInterval=9223372036854775807L
    -Dsun.rmi.dgc.client.gcInterval=9223372036854775807L
    com.yourcompany.YourApplication

This ensures that System.gc() is called after Long.MAX_VALUE milliseconds, which the actual running system may never trigger.

// ObjectTable.class
private static final long gcInterval =
((Long) AccessController.doPrivileged(
    new GetLongAction("sun.rmi.dgc.server.gcInterval", 3600000L)
)).longValue();

As you can see, the default value is 3600000L, which triggers a Full GC every hour.

Another option is to specify the JVM parameter -XX:+DisableExplicitGC, which disables explicitly calling System.gc(). However, we strongly discourage this approach because we are not sure if there are any potential traps involved, such as the need for explicit calls in third-party libraries.

JVMTI Tagging and GC #

If a Java Agent is specified at program startup using -javaagent, the Agent can use JVMTI tagging to tag objects in the heap. If a large number of objects are tagged, it is likely to cause GC performance issues, resulting in increased latency and decreased throughput.

The problem occurs in native code, where JvmtiTagMap::do_weak_oops traverses all tagged objects and performs some time-consuming operations during each GC. What’s worse is that this operation is performed serially.

If there are a large number of tags, it means that a large part of the work during GC is executed in a single thread, and the GC pause time may increase by an order of magnitude.

To check if the GC pause time has increased due to Java Agent, you can use the diagnostic parameter –XX:+TraceJVMTIObjectTagging. After enabling tracing, you can estimate how much native memory is used by the mark mapping in memory, as well as the time consumed during traversal.

If you are not the author of the agent and you generally cannot solve such problems. Other than reporting a bug, you can’t do anything. If this happens, it is recommended to advise the vendor to clean up unnecessary marks. (Previously, we discovered that the APM vendor’s agent occasionally caused JVM OOM crashes in production environments.)

Huge Object Allocation (Humongous Allocations) #

If the G1 garbage collection algorithm is used, there will be a GC performance problem caused by huge objects.

Note: In G1, a huge object refers to an object that occupies more than 50% of a small heap region.

Frequent creation of huge objects will undoubtedly cause GC performance problems. Let’s take a look at how G1 handles it:

If a region contains a huge object, the space after the huge object will not be allocated. If all huge objects exceed a certain proportion, the unused space will cause memory fragmentation.
G1 does not optimize huge objects. This was a particularly thorny problem before JDK 8. In versions before Java 1.8u40, the collection of the region containing the huge object could only be done in Full GC. In the latest version of Hotspot JVM, the humongous region is released in the cleanup phase after the marking phase, so the impact of this problem has been greatly reduced in the new version of the JVM.

To monitor the existence of huge objects, you can open the GC log with the following command:

    java -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
        -XX:+PrintReferenceGC -XX:+UseG1GC
        -XX:+PrintAdaptiveSizePolicy -Xmx128m
        MyClass

The GC log may contain the following section:

     0.106: [G1Ergonomics (Concurrent Cycles)
            request concurrent cycle initiation,
            reason: occupancy higher than threshold,
            occupancy: 60817408 bytes,
            allocation request: 1048592 bytes,
            threshold: 60397965 bytes (45.00 %),
            source: concurrent humongous allocation]
     0.106: [G1Ergonomics (Concurrent Cycles)
            request concurrent cycle initiation,
            reason: requested by GC cause,
            GC cause: G1 Humongous Allocation]
     0.106: [G1Ergonomics (Concurrent Cycles)
            initiate concurrent cycle,
            reason: concurrent cycle initiation requested]
     0.106: [GC pause (G1 Humongous Allocation)
            (young) (initial-mark)
            0.106: [G1Ergonomics (CSet Construction)
                start choosing CSet,
                _pending_cards: 0,
                predicted base
                time: 10.00 ms,
                remaining time: 190.00 ms,
                target pause time: 200.00 ms]

This log is evidence that huge objects are indeed created in the program. You can see that “G1 Humongous Allocation” is the reason for the GC pause. Looking at the allocation request: 1048592 bytes, it can be seen that the program tried to allocate an object of 1048592 bytes, which is 16 bytes larger than 50% of the huge region (2MB).

The first solution is to modify the region size so that most objects do not exceed 50% and do not allocate huge object regions. The default value of G1 region size is calculated based on the heap size at startup. However, you can also specify the parameter to override the default setting, -XX:G1HeapRegionSize=XX. The specified region size must be between 1~32MB and must be a power of two (2^10=1024=1KB, 2^20=1MB, so the region size can only be one of the following values: 1m, 2m, 4m, 8m, 16m, 32m).

This approach also has side effects. Increasing the size of the region indirectly reduces the number of regions, so use it with caution. It is best to conduct some tests to see if it improves throughput and latency.

A better approach is to limit the size of objects in the program. We can use memory analysis tools at runtime to display information about huge objects and the stack trace information when they are allocated.

Summary #

Java, as a universal platform, runs various applications on the JVM, and there are hundreds of startup parameters, many of which affect GC and performance. Therefore, there are many methods to optimize GC performance.

But we always need to remind ourselves that there is no silver bullet that can satisfy all performance tuning indicators.

What we need to do is to understand the key points that may cause problems, master common troubleshooting and analysis methods and tools.

When encountering similar problems, we should understand the underlying principles, deeply understand the working principles of JVM/GC, skillfully apply various means, observe various phenomena, collect useful metric data, conduct qualitative and quantitative analysis, find bottlenecks, develop solutions, optimize and improve, and improve the performance and stability of application systems.