13 High Performance Memory Management in Netty Revisited Part 1

13 High-performance memory management in Netty revisited - Part 1 #

Netty, as a high-performance network framework, needs to handle massive amounts of byte data. By default, Netty provides memory allocation with object pooling, where the memory is returned to the memory pool after use. Therefore, a high-performance memory management mechanism is essential for Netty. In the previous lesson, we introduced the basic principles of native jemalloc, which served as the inspiration for Netty’s high-performance memory management. It also needs to solve two core problems:

  • How to efficiently allocate and release memory in single-threaded or multi-threaded scenarios?
  • How to reduce memory fragmentation and improve memory utilization?

With these two classic problems in mind, we will begin our study of Netty’s memory management.

Memory Size Specification Introduction #

Netty utilizes the concept of memory size classification. Different sizes of memory blocks have different allocation strategies. The classification of memory sizes in Netty is shown in the following figure.

13-1.png

In the figure above, “Tiny” represents memory blocks between 0 and 512 bytes, “Small” represents memory blocks between 512 bytes and 8KB, “Normal” represents memory blocks between 8KB and 16MB, and “Huge” represents memory blocks larger than 16MB. In Netty, SizeClass is defined as an enumeration type to describe the memory size classifications shown in the figure, namely Tiny, Small, and Normal. However, “Huge” is not explicitly defined in the code. When memory greater than 16MB is allocated, it can be classified as a “Huge” scenario, and Netty will directly use non-pooled memory allocation.

Within each region, Netty also defines more granular memory allocation units called Chunk, Page, and Subpage. We will introduce each of them in more detail.

A Chunk is the unit of memory requested by Netty from the operating system. All memory allocation operations in Netty are based on Chunks. A Chunk can be understood as a collection of Pages, with each Chunk having a default size of 16MB.

A Page is the unit used by a Chunk to manage memory. The size of a Page in Netty is 8KB, which should not be confused with the memory page size in Linux. For example, if we need to allocate 64KB of memory, we would select 8 Pages from a Chunk for allocation.

A Subpage is responsible for memory allocation within a Page. If the size of memory being allocated is much smaller than a Page, directly allocating a Page would result in significant memory waste. Therefore, we need to divide a Page into multiple identical sub-blocks for allocation, which are equivalent to Subpages. According to the Tiny and Small memory size classifications, the size of a Subpage is also divided into two cases. In the Tiny scenario, the smallest allocation unit is 16 bytes, increasing by 16 bytes each time, from 16 bytes to 496 bytes. In the Small scenario, it can be divided into four possible sizes: 512 bytes, 1024 bytes, 2048 bytes, and 4096 bytes. The size of a Subpage is not fixed; it depends on the size of the buffer allocated by the user. For example, when allocating 1KB of memory, Netty will divide a Page into 8 Subpages, each with a size of 1KB.

Now that we understand the different granularity of memory allocation units in Netty, let’s take a look at how jemalloc is implemented in Netty.

Netty Memory Pool Architecture Design #

The Netty memory pool can be seen as a Java version of the jemalloc implementation, combined with various JVM features for optimization. As shown in the following figure, let’s take a macro view of the overall layout of the Netty memory pool.

13-2.png

Based on the memory pool model shown in the above figure, Netty abstracts some core components such as PoolArena, PoolChunk, PoolChunkList, PoolSubpage, PoolThreadCache, and MemoryRegionCache. We can see that some of these concepts are similar to jemalloc. Let’s introduce each of them one by one.

PoolArena #

Netty draws inspiration from the design idea of Arena in jemalloc and uses a fixed number of multiple Arenas for memory allocation. The default number of Arenas is related to the number of CPU cores. By creating multiple Arenas, resource contention issues can be alleviated, thereby improving memory allocation efficiency. When a thread first requests memory allocation, it will choose a fixed Arena from the Arena array through round-robin, and only interacts with that Arena throughout the thread’s lifecycle. Therefore, each thread keeps Arena information to improve access efficiency.

According to the type of memory allocation, ByteBuf can be divided into Heap and Direct. Similarly, the abstract class PoolArena provides two subclasses: HeapArena and DirectArena. Let’s take a look at the data structure of PoolArena, as shown in the following figure.

13-3.png

The data structure of PoolArena contains two arrays of PoolSubpages and six PoolChunkLists, with the two arrays storing memory blocks of Tiny and Small types, respectively. The six PoolChunkLists store Chunks with different utilization rates, forming a doubly-linked circular list.

Previously, we introduced the classification of memory specifications in Netty. PoolArena provides implementation for memory allocation in both Subpage and Chunk scenarios. In the PoolArena, PoolSubpage is used to allocate memory blocks smaller than 8KB, while PoolChunkList is used to allocate memory blocks larger than 8KB.

PoolSubpage is also designed with two arrays: tinySubpagePools and smallSubpagePools, according to the Tiny and Small memory size classifications. As mentioned in the Subpage introduction, in the Tiny scenario, the smallest memory unit is 16 bytes, increasing sequentially by 16 bytes, from 16 bytes to 496 bytes. In the Small scenario, it can be divided into four possible sizes: 512 bytes, 1024 bytes, 2048 bytes, and 4096 bytes. Each granularity of memory unit is managed by a PoolSubpage. For example, if we allocate 20 bytes of memory, Netty will round up and allocate a 32-byte PoolSubpage node for it.

PoolChunkList is used for memory allocation in the Chunk scenario. PoolArena initializes six PoolChunkLists: qInit, q000, q025, q050, q075, and q100. This is consistent with the run queue idea in jemalloc, representing different memory utilization rates as follows:

  • qInit, for Chunks with a memory utilization rate of 0% to 25%.
  • q000, for Chunks with a memory utilization rate of 1% to 50%.
  • q025, for Chunks with a memory utilization rate of 25% to 75%.
  • q050, for Chunks with a memory utilization rate of 50% to 100%.
  • q075, for Chunks with a memory utilization rate of 75% to 100%.
  • q100, for Chunks with a memory utilization rate of 100%. The six types of PoolChunkList, except for qInit, form a bidirectional linked list among themselves, as shown in the figure below.

13-4.png

With the change in the memory utilization of Chunk, Netty will re-evaluate the memory utilization and put it into the corresponding PoolChunkList. Therefore, PoolChunk will move to different PoolChunkLists.

When I first started learning PoolChunkList, I had a question about why qInit and q000 needed to be designed as two separate lists instead of merging them into one. In fact, they both have their uses.

qInit is used to store the initially allocated PoolChunk, because there is no available PoolChunk in the PoolChunkList when the first memory allocation occurs. Therefore, a new PoolChunk needs to be created and added to the qInit list. Even if the memory in qInit’s PoolChunk is fully released, it will not be reclaimed, avoiding the redundant initialization work of PoolChunk.

q000 is used to store PoolChunks with a memory utilization rate of 1 to 50%. After the memory in a PoolChunk in q000 is fully released, the PoolChunk is removed from the linked list, and the corresponding allocated memory is reclaimed.

One point to note is that when allocating memory larger than 8K, the order of accessing the PoolChunkLists is q050 -> q025 -> q000 -> qInit -> q075. The code for traversing and checking if there is a PoolChunk available for memory allocation in the PoolChunkList is as follows:

private void allocateNormal(PooledByteBuf<T> buf, int reqCapacity, int normCapacity) {

    if (q050.allocate(buf, reqCapacity, normCapacity) || q025.allocate(buf, reqCapacity, normCapacity) ||

        q000.allocate(buf, reqCapacity, normCapacity) || qInit.allocate(buf, reqCapacity, normCapacity) ||

        q075.allocate(buf, reqCapacity, normCapacity)) {

        return;

    }

    PoolChunk<T> c = newChunk(pageSize, maxOrder, pageShifts, chunkSize);

    boolean success = c.allocate(buf, reqCapacity, normCapacity);

    assert success;

    qInit.add(c);

}

You may have a question here, why is q050 chosen as the first choice instead of starting from q000?

It can be said that this is a compromise. In scenarios where memory is frequently allocated, if starting from q000, most PoolChunks will face frequent creation and destruction, which reduces the performance of memory allocation. If starting from q050, the utilization level of PoolChunks can be kept at a middle level, reducing the probability of PoolChunks being reclaimed, thus balancing performance.

PoolArena is a very important part of Netty’s memory allocation, and we have spent a lot of time explaining it. This will be helpful for understanding the implementation principles of memory allocation later on.

PoolChunkList #

PoolChunkList is responsible for managing the life cycle of multiple PoolChunks. PoolChunks with similar memory utilization rates are stored in the same PoolChunkList, which is also connected in the form of a bidirectional linked list. The structure of PoolChunkList is shown in the following figure. Because PoolChunk often needs to be removed from PoolChunkList and moved to different PoolChunkLists, a bidirectional linked list is a data structure with low time complexity for managing PoolChunks.

13-5.png

Each PoolChunkList has upper and lower limits for memory utilization: minUsage and maxUsage. After a PoolChunk performs memory allocation, if its utilization rate exceeds maxUsage, the PoolChunk will be removed from the current PoolChunkList and moved to the next PoolChunkList. Similarly, when the memory in a PoolChunk is released and its utilization rate is below minUsage, the PoolChunk will be removed from the current PoolChunkList and moved to the previous PoolChunkList.

Looking back at the six PoolChunkLists initialized by Netty, each PoolChunkList has overlapping thresholds for memory utilization, as shown in the following figure. Because PoolChunks need to be moved continuously between PoolChunkLists, if the thresholds for memory utilization of each PoolChunkList are exactly adjacent, for example, 1 to 50% and 50 to 75%, if the utilization rate of PoolChunks is always at the 50% threshold, it will cause PoolChunks to move between the two PoolChunkLists frequently, resulting in performance degradation.

13-6.png

PoolChunk #

Netty’s memory allocation and reclamation are based on PoolChunk. PoolChunk is where the actual memory data is stored. The default size of each PoolChunk is 16M. First, let’s take a look at the definition of the PoolChunk data structure:

final class PoolChunk<T> implements PoolChunkMetric {
    final PoolArena<T> arena;

    final T memory; // Stored data

    private final byte[] memoryMap; // Whether the nodes in the full binary tree are allocated, the array size is 4096

    private final byte[] depthMap; // The height of the nodes in the full binary tree, the array size is 4096

    private final PoolSubpage<T>[] subpages; // 2048 8K memory blocks managed by PoolChunk

    private int freeBytes; // Remaining memory size

    PoolChunkList<T> parent;

    PoolChunk<T> prev;

    PoolChunk<T> next;
    // Other code is omitted

}

A PoolChunk can be understood as a collection of pages, where a page is just an abstract concept. In Netty, a page refers to the sub-memory block managed by a PoolChunk, and each sub-memory block is represented by a PoolSubpage. Netty uses the buddy algorithm to allocate a PoolChunk into 2048 pages, forming a complete binary tree, where the memory of all child nodes in the binary tree is managed by their parent node, as shown in the following diagram.

13-7.png

Based on the structure of PoolChunk, let’s introduce some important properties of PoolChunk:

depthMap is used to store the height of each node. For example, the depth of the 1025th node is depthMap[1025] = 10.

memoryMap is used to record the allocation information of nodes in the binary tree. The initial values of memoryMap are the same as depthMap. As nodes are allocated, not only the values of the nodes change, but also the values of their parent nodes are recursively updated. The value of a parent node is the minimum value among its two child nodes.

subpages corresponds to Page0, Page1, Page2, …, Page2047 in the diagram. Netty does not have a specific definition for Page; instead, it directly uses PoolSubpage to manage them. When the allocated memory is less than 8K, each Page node in PoolChunk will be divided into smaller memory blocks for management, which are also managed by PoolSubpage. From the diagram, we can see that in the scenario of allocating small memory, the corresponding PoolArena will be found first. Then, based on the calculated index, the corresponding tinySubpagePools or smallSubpagePools array is used. If the PoolSubpage linked list corresponding to the array element does not contain any nodes, a new PoolSubpage will be created and added to the list.

PoolSubpage #

By now, we should have a better understanding of PoolSubpage. In the scenario of allocating small memory, where the allocated memory size is less than one page (8K), PoolSubpage is used for management. Let’s first take a look at the definition of PoolSubpage:

final class PoolSubpage<T> implements PoolSubpageMetric {

    final PoolChunk<T> chunk;

    private final int memoryMapIdx; // The index of the subpage in the full binary tree

    private final int runOffset; // The offset of the `PoolSubpage` in the `memory` of `PoolChunk`

    private final long[] bitmap; // Records the status of each small memory block

    // Doubly linked list connecting the elements in `tinySubpagePools` or `smallSubpagePools` in `PoolArena`

    PoolSubpage<T> prev;

    PoolSubpage<T> next;

    int elemSize; // The size of each small memory block

    private int maxNumElems; // The maximum number of small memory blocks that can be stored: 8K / elemSize

    private int numAvail; // The number of memory blocks available for allocation
    // Other code is omitted

}

PoolSubpage #

The meaning of each attribute in PoolSubpage is clear and easy to understand. I have marked them as comments, so I won’t repeat them here. I will only point out two key knowledge points:

The first one is how PoolSubpage records the usage status of memory blocks. PoolSubpage uses a bitmap to record whether the sub-memory is already in use, and the value of each bit is 0 or 1, as shown in the following diagram.

13-8.png

The second one is how PoolSubpage is associated with PoolArena.

As we have learned before, when PoolArena is created, it initializes two arrays of PoolSubpage: tinySubpagePools and smallSubpagePools. The size of the arrays is 32 and 4, respectively.

If we now need to allocate memory of size 20B, it is rounded up to 32B. Then, from the 11th level of the complete binary tree, we find a PoolSubpage node and divide it into 8KB / 32B = 256 small memory blocks. Then, we find the corresponding PoolArena of this PoolSubpage node and connect the PoolSubpage node with the head node of tinySubpagePools[1] to form a doubly linked list, as shown in the following diagram.

13-9.png

The next time when a 32B memory request is made, it directly checks if there is an available PoolSubpage in the next node of tinySubpagePools[1] in PoolArena. If there is, it directly uses that PoolSubpage for memory allocation. This improves the efficiency of memory allocation. The allocation principle for other memory sizes is similar.

PoolThreadCache & MemoryRegionCache #

As the name suggests, PoolThreadCache corresponds to the local thread cache in jemalloc. How is PoolThreadCache used? What types of data can it cache?

When memory is released, similar to jemalloc, Netty does not return the cache to PoolChunk, but caches it using PoolThreadCache. The next time the same memory size is requested, it can be directly used by taking it from PoolThreadCache. PoolThreadCache caches the data of three types: Tiny, Small, and Normal. It also distinguishes between heap and direct memory types, as shown in the source code definition of PoolThreadCache:

final class PoolThreadCache {

    final PoolArena<byte[]> heapArena;

    final PoolArena<ByteBuffer> directArena;

    private final MemoryRegionCache<byte[]>[] tinySubPageHeapCaches;

    private final MemoryRegionCache<byte[]>[] smallSubPageHeapCaches;

    private final MemoryRegionCache<ByteBuffer>[] tinySubPageDirectCaches;

    private final MemoryRegionCache<ByteBuffer>[] smallSubPageDirectCaches;

    private final MemoryRegionCache<byte[]>[] normalHeapCaches;

    private final MemoryRegionCache<ByteBuffer>[] normalDirectCaches;
    // Omitted other code

}

PoolThreadCache has an important data structure called MemoryRegionCache. MemoryRegionCache has three important attributes: queue, sizeClass, and size. The following diagram shows the range of attribute values for different memory sizes.

13-10.png

MemoryRegionCache is actually a queue. When memory is released, the memory block is added to the queue. The next time memory of the same size is allocated, an idle memory block is directly taken from the queue.

PoolThreadCache maintains different sizes of memory using separate MemoryRegionCache. The following diagram shows each node corresponding to a MemoryRegionCache. For example, in the Tiny scenario, 32 memory sizes correspond to 32 MemoryRegionCache instances. Therefore, the lengths of the tinySubPageHeapCaches, smallSubPageHeapCaches, and normalHeapCaches arrays in the source code of PoolThreadCache are 32, 4, and 3 respectively.

13-11.png

At this point, the core components of memory management in Netty have been introduced. I recommend revisiting the core concepts of jemalloc and making a simple comparison with Netty. This will help clarify your understanding.

Summary #

Knowledge converges on different paths. Once you understand jemalloc, memory management in Netty is not so difficult. Most of the ideas are consistent with jemalloc, so laying a solid foundation is very important. In the next lesson, we will continue to explore the implementation principles of memory allocation and deallocation in Netty.