17 Don't Assume That Automatic Garbage Collection Won't Cause Oom

17 Don’t Assume that Automatic Garbage Collection Won’t Cause OOM #

Today, I want to share with you the topic that even with “automatic transmission,” it is still possible to encounter OOM (Out of Memory).

Here, “automatic transmission” is a nickname I use to refer to Java’s automatic garbage collector. Indeed, after so many years of development, Java’s garbage collector has become very mature. With automatic garbage collection, in most cases, we can focus on the business logic when writing programs without having to worry too much about object allocation and deallocation, and generally OOM errors won’t occur.

However, memory space is still limited, and Java’s major memory areas are still susceptible to OOM errors. Accordingly, the common types of OOM errors in Java programs can be divided into heap memory OOM, stack OOM, metaspace OOM, direct memory OOM, etc. Almost every type of OOM error can be simulated with a few lines of code, and there are also many resources available that demonstrate the allocation of oversized or unlimited objects in the heap, metaspace, and direct memory, as well as attempts to create an infinite number of threads or perform unlimited recursive method calls.

But it is worth noting that our business code does not usually do such things. So today, I will showcase some cases from the perspective of memory allocation awareness to illustrate some pitfalls in business code that may lead to OOM. These pitfalls can be due to lack of awareness of object allocation, improper resource usage, or failure to control the amount of data in the cache, etc.

In the previous lesson on threads, we have already seen two scenarios of OOM: one caused by using an unbounded queue leading to heap OOM, and the other caused by using a thread pool without a maximum thread count limit resulting in unlimited thread creation and OOM errors. Next, let’s take a look at what other oversights in awareness might lead to OOM when writing business code.

OOM caused by too many duplicate objects #

The first case I want to share is as follows. There is a project that caches the complete user data in memory, and when searching for users, it can directly return user information from the cache. Now, to improve the user experience, we need to implement the functionality of automatically suggesting completion of usernames when typing in part of a username (also known as autocomplete).

In Lesson 10, when talking about collections, I mentioned that for such fast retrieval requirements, it is best to use a Map to implement it, which is much faster than searching directly from a List.

To implement this functionality, we need a HashMap to store this user data, where the key is the user name index and the value is the corresponding list of users. For example, if there are two users “aa” and “ab”, then there are three keys: “a”, “aa”, and “ab”. When the user enters the letter “a”, they can get all the users that start with the letter “a” from the list of values, namely “aa” and “ab”.

In the code, we save 10,000 test users with randomly generated names ranging from “a” to “j” in the database, and then we store each prefix of the username from one character to the full username as the key in the cache, and the value in the cache is a List of UserDTO objects, which stores all the indexes of the same username and the corresponding user information:

// Autocomplete index, where Key is the partial username input by the user, and Value is the corresponding user data
private ConcurrentHashMap<String, List<UserDTO>> autoCompleteIndex = new ConcurrentHashMap<>();

@Autowired
private UserRepository userRepository;

@PostConstruct
public void wrong() {

    // Save 10,000 users with randomly generated names to the database
    userRepository.saveAll(LongStream.rangeClosed(1, 10000).mapToObj(i -> new UserEntity(i, randomName())).collect(Collectors.toList()));

    // Load all users from the database
    userRepository.findAll().forEach(userEntity -> {

        int len = userEntity.getName().length();
        // For each user, index their username with the first N characters, where N can be 1 to 6
        for (int i = 0; i < len; i++) {
            String key = userEntity.getName().substring(0, i + 1);
            autoCompleteIndex.computeIfAbsent(key, s -> new ArrayList<>())
                .add(new UserDTO(userEntity.getName()));
        }
    });
    log.info("autoCompleteIndex size:{} count:{}", autoCompleteIndex.size(),
                autoCompleteIndex.entrySet().stream().map(item -> item.getValue().size()).reduce(0, Integer::sum));
}

For each UserDTO object, in addition to the username, we also add around 10K mock data to simulate its user information:

@Data
public class UserDTO {

    private String name;

    @EqualsAndHashCode.Exclude
    private String payload;

    public UserDTO(String name) {
        this.name = name;
        this.payload = IntStream.rangeClosed(1, 10_000)
                .mapToObj(__ -> "a")
                .collect(Collectors.joining(""));
    }
}

After running the program, the log outputs the following:

[11:11:22.982] [main] [INFO ] [.t.c.o.d.UsernameAutoCompleteService:37  ] - autoCompleteIndex size:26838 count:60000

As we can see, there are a total of 26838 indexes (which are all combinations of 1 to 6 characters for all usernames), and the HashMap’s value, which is the List, has a total of 10,000 users * 6 = 60,000 UserDTO objects.

Using the memory analysis tool MAT to open the heap dump, we can see that the 60,000 UserDTO objects occupy about 1.2GB of memory:

At this point, we realize that although there are only 10,000 real users, because we use part of the username as the index key, the cache has 26838 keys, and there are 60,000 user information entries in the cache. If our usernames are not 6 characters but 10 or 20 characters, then the cached user information could be 100,000 or even 200,000, which will inevitably cause a heap OOM error.

If we try to increase the maximum length of the username and restart the program, we will see something like this error:

[17:30:29.858] [main] [ERROR] [ringframework.boot.SpringApplication:826 ] - Application run failed
org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'usernameAutoCompleteService': Invocation of init method failed; nested exception is java.lang.OutOfMemoryError: Java heap space

We may naturally assume that if there are 10,000 users in the database, there should only be 10,000 UserDTO objects in memory. However, when implementing it, a new UserDTO object is created every time and added to the cache, so in memory, they are all new objects. In actual projects, the user information cache may be incrementally cached as the user types, rather than fully cached during program initialization, so the problem may not be exposed as early as in this case.

Knowing the cause, the solution is quite simple. We add all the UserDTO objects to a HashSet first, because UserDTO is unique based on its name, so duplicate usernames will be filtered out, and the UserDTO objects added to the HashSet will be less than 10,000.

With the HashSet caching all possible UserDTO information, when building the autoCompleteIndex HashMap for autocomplete, we can directly get all user information from the HashSet to build it. This way, different combinations of the same username prefix (such as the user with the username “abc”, the keys “a”, “ab”, and “abc”) will be associated with the same UserDTO object:

@PostConstruct
public void right() {
    ...
    HashSet<UserDTO> cache = userRepository.findAll().stream()
        .map(item -> new UserDTO(item.getName()))
        .collect(Collectors.toCollection(HashSet::new));
    cache.stream().forEach(userDTO -> {
        int len = userDTO.getName().length();
        for (int i = 0; i < len; i++) {
            String key = userDTO.getName().substring(0, i + 1);
            autoCompleteIndex.computeIfAbsent(key, s -> new ArrayList<>())
                .add(userDTO);
        }
    });
    ...
}

After analyzing the heap memory again, we can see that there are only 9945 UserDTO objects, occupying less than 200MB in total. This is the result we truly want.

With the fixed program, not only is there only one copy of the same UserDTO, but the total number of copies has been reduced to one-sixth of the original; and due to the deduplication feature of HashSet, memory is saved twice.

It is worth noting that although we know the total amount of data, we neglected the fact that each piece of data may have multiple copies in memory. I encountered another case before, where a back-end program needed to load a large amount of information from the database for data export. This data occupied 100MB in the database, but the 1GB JVM heap was not enough to complete the export operation.

Let me explain the reason. Loading 100MB of data into program memory would already take up 200MB of heap memory when converted into Java data structure. These data are actually loaded twice through JDBC, MyBatis, and other frameworks, and then loaded another two times during domain model and DTO conversion. As a result, the memory usage reaches 200MB * 4 = 800MB.

Therefore, when performing capacity evaluation, we cannot assume that one piece of data in the program memory is also one piece of data.

Using WeakHashMap does not mean no OOM #

For the case of implementing a fast retrieval as mentioned in the previous section, in order to prevent the accumulation of a large number of data in the cache causing OOM, some students may consider using WeakHashMap as the cache container.

The characteristic of WeakHashMap is that the Key is a weak reference within the hash table. When there is no strong reference pointing to this Key, the Entry will be garbage collected. Even if we keep adding data to WeakHashMap indefinitely, as long as the Key is no longer in use, there will be no OOM.

Speaking of strong references and weak references, let’s review the relationship between reference types and garbage collection in Java:

The garbage collector does not collect objects with strong references;
When memory is abundant, the garbage collector does not collect objects with soft references;
The garbage collector collects objects with weak references as long as it scans them.

WeakHashMap leverages this feature of weak references.

However, the second case I want to share with you is a recent case where WeakHashMap was used but eventually resulted in OOM. For now, let’s not discuss whether it is appropriate to use WeakHashMap as a cache and analyze this OOM problem first.

Declare a WeakHashMap with Key as User type and Value as UserProfile type as the user data cache, add 2 million entries to it, and then use ScheduledThreadPoolExecutor to initiate a scheduled task to output the number of entries in the cache every 1 second:

private Map<User, UserProfile> cache = new WeakHashMap<>();

@GetMapping("wrong")
public void wrong() {

    String userName = "zhuye";

    // Output the number of entries in the cache every 1 second
    Executors.newSingleThreadScheduledExecutor().scheduleAtFixedRate(
            () -> log.info("cache size:{}", cache.size()), 1, 1, TimeUnit.SECONDS);

    LongStream.rangeClosed(1, 2000000).forEach(i -> {
        User user = new User(userName + i);
        cache.put(user, new UserProfile(user, "location" + i));
    });
}

After executing the program, the log is as follows:

[10:30:28.509] [pool-3-thread-1] [INFO ] [t.c.o.demo3.WeakHashMapOOMController:29  ] - cache size:2000000
[10:30:29.507] [pool-3-thread-1] [INFO ] [t.c.o.demo3.WeakHashMapOOMController:29  ] - cache size:2000000
[10:30:30.509] [pool-3-thread-1] [INFO ] [t.c.o.demo3.WeakHashMapOOMController:29  ] - cache size:2000000

As you can see, the output cache size is always 2 million, even if we manually GC through jvisualvm. This indicates that these entries cannot be garbage collected. If you change 2 million to 10 million, you can see the following OOM errors in the log:

Exception in thread "http-nio-45678-exec-1" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "Catalina-utility-2" java.lang.OutOfMemoryError: GC overhead limit exceeded

Let’s analyze this problem. After dumping the heap, it can be seen that there are 2 million UserProfile and User objects in the heap:

The following is the definition of the User and UserProfile classes. It is worth noting that the Key of WeakHashMap is a User object, and its Value is UserProfile, which holds a reference to User:

@Data
@AllArgsConstructor
@NoArgsConstructor
class User {

    private String name;
}

@Data
@AllArgsConstructor
@NoArgsConstructor
public class UserProfile {

    private User user;

    private String location;

}

Yes, this is where the problem lies. Analyzing the source code of WeakHashMap, you will find the biggest difference between WeakHashMap and HashMap is the implementation of the Entry object. Let’s temporarily ignore the implementation of HashMap and take a look at the Entry object:

private static class Entry<K,V> extends WeakReference<Object> ...

/**
 * Creates new entry.
 */
Entry(Object key, V value,
      ReferenceQueue<Object> queue,
      int hash, Entry<K,V> next) {
        super(key, queue);
        this.value = value;
        this.hash  = hash;
        this.next  = next;
}

The Entry object extends WeakReference, and the constructor of Entry calls super(key, queue), which is the constructor of the parent class. In this case, key is the key we passed when executing the put method, and queue is a ReferenceQueue. If you understand Java references, you will know that objects that are garbage collected will be placed in this queue.

Let’s see how objects that are put into the queue are destroyed:

public V get(Object key) {

    Object k = maskNull(key);

    int h = hash(k);

    Entry<K,V>[] tab = getTable();

int index = indexFor(h, tab.length);

Entry<K,V> e = tab[index];

while (e != null) {
    if (e.hash == h && eq(k, e.get()))
        return e.value;
    e = e.next;
}
return null;
}

private Entry<K,V>[] getTable() {
expungeStaleEntries();
return table;
}

/**
* Expunges stale entries from the table.
*/
private void expungeStaleEntries() {

for (Object x; (x = queue.poll()) != null; ) {

    synchronized (queue) {
        @SuppressWarnings("unchecked")
        Entry<K,V> e = (Entry<K,V>) x;
        int i = indexFor(e.hash, table.length);

        Entry<K,V> prev = table[i];

        Entry<K,V> p = prev;

        while (p != null) {
            Entry<K,V> next = p.next;
            if (p == e) {
                if (prev == e)
                    table[i] = next;
                else
                    prev.next = next;
                // Must not null out e.next;
                // stale entries may be in use by a HashIterator
                e.value = null; // Help GC
                size--;
                break;
            }
            prev = p;
            p = next;
        }
    }
}
}

From the source code, we can see that every time the get, put, size, and other methods are called, all the keys that have been garbage-collected are removed from the queue as well as the corresponding Entry objects. Let’s review the logic again:

When an object is put into the map, its key is encapsulated into a weak reference object.

When a garbage collection occurs, the weak reference key is detected and placed in the queue.

When the get and other methods are called, the queue is scanned and the key, as well as the Entry object that contains the key and value, are deleted.

Although the key of WeakHashMap is a weak reference, its value, however, holds a strong reference to the object in the key. Value is referenced by Entry, and Entry is referenced by WeakHashMap, ultimately preventing the key from being garbage collected. The solution is to make the value a weak reference by using WeakReference to wrap UserProfile:

private Map<User, WeakReference<UserProfile>> cache2 = new WeakHashMap<>();

@GetMapping("right")
public void right() {

    String userName = "zhuye";

    // Output the number of entries in the cache every 1 second
    Executors.newSingleThreadScheduledExecutor().scheduleAtFixedRate(
            () -> log.info("cache size:{}", cache2.size()), 1, 1, TimeUnit.SECONDS);
    LongStream.rangeClosed(1, 2000000).forEach(i -> {
        User user = new User(userName + i);
        // This time, we use a weak reference to wrap UserProfile
        cache2.put(user, new WeakReference(new UserProfile(user, "location" + i)));
    });
}

After running the program again, observe from the log that the cache size is no longer a fixed 2 million, but is continuously decreasing, and even after manually running a garbage collection, all the entries are reclaimed:

[10:40:05.792] [pool-3-thread-1] [INFO ] [t.c.o.demo3.WeakHashMapOOMController:40  ] - cache size:1367402
[10:40:05.795] [pool-3-thread-1] [INFO ] [t.c.o.demo3.WeakHashMapOOMController:40  ] - cache size:1367846
[10:40:06.773] [pool-3-thread-1] [INFO ] [t.c.o.demo3.WeakHashMapOOMController:40  ] - cache size:549551
...
[10:40:20.742] [pool-3-thread-1] [INFO ] [t.c.o.demo3.WeakHashMapOOMController:40  ] - cache size:549551
[10:40:22.862] [pool-3-thread-1] [INFO ] [t.c.o.demo3.WeakHashMapOOMController:40  ] - cache size:547937
[10:40:22.865] [pool-3-thread-1] [INFO ] [t.c.o.demo3.WeakHashMapOOMController:40  ] - cache size:542134
[10:40:23.779] [pool-3-thread-1] [INFO ] 
// Manually invoke GC
[t.c.o.demo3.WeakHashMapOOMController:40  ] - cache size:0

Of course, another solution is to have the value, UserProfile, no longer reference the key, but assign a new User object to UserProfile:

@GetMapping("right2")
public void right2() {

    String userName = "zhuye";
    ...
    User user = new User(userName + i);
    cache.put(user, new UserProfile(new User(user.getName()), "location" + i));
}

In addition, ConcurrentReferenceHashMap provided by Spring can also be used for caching with weak or soft references. Both the key and value are wrapped in soft or weak references, which can also solve the problem of data not being released due to mutual referencing. Compared to WeakHashMap, ConcurrentReferenceHashMap not only performs better, but also ensures thread safety. You can experiment and test it yourself.

Improper Tomcat parameter configuration leads to OOM #

Let’s take a look at another case. One time, an operations colleague reported that an application would become unresponsive during periods of heavy traffic, and the logs showed numerous OOM exceptions:

[13:18:17.597] [http-nio-45678-exec-70] [ERROR] [ache.coyote.http11.Http11NioProtocol:175 ] - Failed to complete processing of a request
java.lang.OutOfMemoryError: Java heap space

In response, I asked the operations colleague to generate a heap dump in the production environment. After opening the dump file with MAT, we immediately identified the cause of the OOM: there was an allocation of nearly 1.7GB of byte arrays, while the maximum heap memory for the JVM process was only configured as 2GB:

By examining the references, we discovered that a large number of references were related to Tomcat’s worker threads. Most of these threads had allocated two arrays of around 10M, and with approximately 100 worker threads, the memory was fully consumed. The first red box represents the Http11InputBuffer, which has a buffer size of 10008192 bytes. The second red box represents the buffer of the Http11OutputBuffer, which exactly occupies 10000000 bytes:

Let’s first understand why the Http11InputBuffer consumes so much memory. By examining the init method of the Http11InputBuffer class, we noticed that one of the initialization methods will allocate memory of size headerBufferSize + readBuffer:

void init(SocketWrapperBase<?> socketWrapper) {

    wrapper = socketWrapper;
    wrapper.setAppReadBufHandler(this);
    int bufLength = headerBufferSize +
            wrapper.getSocketBufferHandler().getReadBuffer().capacity();
    if (byteBuffer == null || byteBuffer.capacity() < bufLength) {
        byteBuffer = ByteBuffer.allocate(bufLength);
        byteBuffer.position(0).limit(0);
    }
}

In the Tomcat documentation, it is mentioned that the read buffer for this socket, which is the readBuffer, is by default 8192 bytes. Clearly, the issue lies with the headerBufferSize:

Tracing the initialization of the Http11InputBuffer back to the Http11Processor class, we can see that the MaxHttpHeaderSize is configured as the headerBufferSize:

inputBuffer = new Http11InputBuffer(request, protocol.getMaxHttpHeaderSize(),
protocol.getRejectIllegalHeaderName(), httpParser);

The buffer in the Http11OutputBuffer exactly occupies 10000000 bytes, but why? By examining the constructor of the Http11OutputBuffer, we can see that it directly allocates a fixed-size headerBuffer based on the headerBufferSize:

protected Http11OutputBuffer(Response response, int headerBufferSize){
...
   headerBuffer = ByteBuffer.allocate(headerBufferSize);
}

This leads us to the conclusion that the application has configured the Tomcat header-related parameters as 10000000, causing each request to consume 20M of memory for both the Request and Response. This ultimately led to an OOM when there were high levels of concurrency.

As expected, upon reviewing the project’s code, we found this configuration in the configuration file:

server.max-http-header-size=10000000

Looking through the commit history, we can see that the developer encountered the following exception at the time:

java.lang.IllegalArgumentException: Request header is too large

In response, the developer searched online for a solution and casually modified server.max-http-header-size to an extremely large value, hoping to prevent similar problems from occurring in the future. However, this modification unexpectedly caused such a huge problem. By changing this parameter to a more reasonable value of 20000 and conducting a load test, we found that all metrics of the application became stable.

This case reminds us that parameter configurations should be modified based on actual requirements, and it is advisable to reserve a multiplier of 2 to 5 times the required capacity. Parameters related to capacity often represent resource allocations, and setting overly large values can occupy unnecessary resources, leading to OOM when there is a high level of concurrency.

Key Review #

Today, I shared with you the issue of OOM from the perspective of memory allocation awareness. Generally speaking, there are several possibilities for OOM in Java programs.

First, our program indeed needs more memory than the upper limit configured in the JVM. Whether it’s because of unreasonable implementation or the duplication, processing, and conversion of data by various frameworks, the same data may not necessarily occupy only one space in memory. For business logic that requires a large amount of memory usage, such as caching logic, file upload and download logic, and export logic, when evaluating capacity, we may need to actually do a Dump instead of making simple assumptions.

Second, memory leaks can occur when objects that we consider to be unused are not eventually garbage collected. GC does not reclaim objects with strong references. We may often define containers as caches in our programs, but if the data in the container keeps growing without limit, be careful as it may eventually lead to OOM. Using WeakHashMap is a good way to solve this problem, but it’s worth noting that if the strongly referenced Value references the Key, the Entry cannot be reclaimed.

Third, unreasonable resource demand configuration may not be a problem when the business volume is small, but it may quickly consume memory when the business volume becomes large. For example, arbitrarily configuring the max-http-header-size parameter in Tomcat will cause one request to consume too much memory, resulting in OOM when the number of requests is large. When configuring parameters, we need to realize that many limiting parameters restrict the use of underlying resources, and resources are always limited, so parameters should be set reasonably based on actual needs.

Finally, I want to say that there is no need to be overly nervous when encountering OOM. We can roughly locate the memory block and type where OOM occurs based on the exception information in the error log, combined with command-line tools like jstat to observe memory usage, as well as the GC log of the program. In fact, 90% of the OOM we encounter are heap OOM, and it is generally easy to locate the problem by performing a heap memory Dump on the JVM process or using the jmap command to analyze the memory usage ranking of objects.

Here, I suggest that you configure JVM parameters for the production system’s program to enable detailed GC logs for observing the behavior of the garbage collector, and enable HeapDumpOnOutOfMemoryError so that the first problem scene can be automatically dumped when OOM occurs. For JDK 8, you can configure it like this:

-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=. -XX:+PrintGCDateStamps -XX:+PrintGCDetails -Xloggc:gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=100M

I have put the code used today on GitHub, and you can click on this link to view it.

Reflection and Discussion #

ConcurrentReferenceHashMap in Spring supports two types of references for both keys and values: soft references and weak references. Which type do you think is more suitable for caching?

When we need to execute some expressions dynamically, we can use the Groovy dynamic language: instantiate a GroovyShell class and then call the evaluate method to dynamically execute the script. The problem with this approach is that it generates a large number of classes repeatedly, increasing the GC burden on the Metaspace area and possibly causing an OOM. Do you know how to avoid this problem?

Have you encountered any cases of OOM or memory leaks? I’m Zhu Ye, feel free to leave me a comment in the comment section and share your thoughts. You are also welcome to share today’s content with your friends or colleagues for further discussion.