14 High Performance Memory Management in Netty Revisited Part 2

14 High-performance memory management in Netty revisited - Part 2 #

In the previous lesson, we learned about Netty’s memory specifications classification and the core components of memory management. In today’s lesson, we will continue to introduce the implementation principles of memory allocation and deallocation in Netty. With the foundation from the previous lesson, we believe that the learning process ahead will be more efficient.

In this lesson, we will focus on analyzing in detail the process of memory allocation and deallocation in different scenarios in Netty, so that you can have a clearer understanding of the overall design of Netty’s memory pool.

Implementation principles of memory allocation #

Netty has two components responsible for thread allocation: PoolArena and PoolThreadCache. PoolArena is shared by multiple threads, and each thread is bound to a fixed PoolArena. PoolThreadCache is a private cache space for each thread, as shown in the diagram below.

Drawing 0.png

In the previous lesson, we introduced PoolChunk, PoolSubpage, and PoolChunkList, which are concepts used in PoolArena. The memory units managed by PoolArena are PoolChunks, and each PoolChunk is divided into 2048 Pages of 8KB each. When the requested memory is larger than 8KB, PoolChunk allocates memory in Page units. When the requested memory size is smaller than 8KB, it is managed by PoolSubpage for finer-grained memory allocation.

After the allocated memory in PoolArena is released, it is not immediately returned to PoolChunk. Instead, it is cached in the local private cache PoolThreadCache. When the next memory allocation is performed, it will first search for a matching memory block from PoolThreadCache.

From this, we can see that Netty adopts different allocation strategies for different memory specifications. We will analyze the following three scenarios one by one:

  • Allocation of memory larger than 8KB, using Page-level allocation strategy in PoolChunk.
  • Allocation of memory smaller than 8KB, managed by PoolSubpage.
  • Allocation of memory smaller than 8KB, for improved allocation efficiency, provided by the local thread cache PoolThreadCache.

Page-level memory allocation in PoolChunk #

Each PoolChunk has a default size of 16MB. PoolChunk manages multiple Pages using the buddy allocation algorithm. Each PoolChunk is divided into 2048 Pages, and this is implemented using a complete binary tree. Let’s review the binary tree structure of PoolChunk together, as shown in the following diagram.

Drawing 1.png

Suppose the user needs to sequentially allocate memory of sizes 8KB, 16KB, and 8KB. Using this example, we will describe in detail how PoolChunk allocates memory at the Page level, to help you understand the principles of the buddy allocation algorithm.

First, let’s take a look at the source code of the allocation logic called allocateRun, as shown below. The main steps for PoolChunk to allocate Pages are as follows: first, calculate the height of the binary tree based on the size of the requested memory allocation; then, check if there is an available node in the corresponding height; if the allocation is successful, subtract the allocated memory size from the remaining available space.

private long allocateRun(int normCapacity) {

    // Calculate the height of the binary tree based on the size of the requested memory allocation

    int d = maxOrder - (log2(normCapacity) - pageShifts);

    // Check if there is an available node in the corresponding height

    int id = allocateNode(d);

    if (id < 0) {

        return id;

    }

    // Subtract the allocated memory size from the remaining available space

    freeBytes -= runLength(id);

    return id;
}

结合 PoolChunk’s binary tree structure and the allocateRun source code, let’s analyze the simulated example:

When allocating memory of size 8K for the first time, the height of the binary tree is calculated as d = maxOrder - (log2(normCapacity) - pageShifts), where maxOrder is the maximum height of the binary tree, normCapacity is 8K, and pageShifts is the default value of 13, because only when the requested memory size is greater than 2^13 = 8K will allocateRun be used to allocate memory. Then, the available Page is searched from the 11th level, and the node at index 2048 can be used for memory allocation, that is, Page[0] is allocated and used, at this time memoryMap[2048] = 12 is assigned to indicate that the node is no longer available. Then, the parent node’s value is recursively updated, and the parent node’s value is the minimum of the two child nodes’ values, memoryMap[1024] = 11, memoryMap[512] = 10, and so on until memoryMap[1] = 1. The updated binary tree allocation result is shown in the following figure.

图片3.png

When allocating memory of size 16K for the second time, the height of the required node is calculated as 10. At this time, node 1024 has already allocated a 8K memory and no longer meets the conditions, so we continue to find node 1025. Node 1025 has not been used before and meets the allocation conditions, so both of its child nodes 2050 and 2051 are allocated, and memoryMap[2050] = 12 and memoryMap[2051] = 12 are assigned. The parent node’s value is recursively updated again, and the updated binary tree allocation result is shown in the following figure.

图片4.png

When allocating memory of size 8K for the third time, we still start searching from the 11th level of the binary tree. Node 2048 has already been used, but node 2049 can be allocated, so memoryMap[2049] = 12 is assigned, and the parent node’s value is recursively updated, memoryMap[1024] = 12, memoryMap[512] = 12, and so on until memoryMap[1] = 1. The final binary tree allocation result is shown in the following figure.

图片5.png

So far, the memory allocation at the Page level in PoolChunk has been explained. It can be seen that the buddy algorithm maximizes the contiguity of allocated memory addresses and effectively reduces memory fragmentation.

Memory Allocation at the Subpage Level #

In order to improve the utilization rate of memory allocation, when allocating memory smaller than 8K, PoolChunk does not allocate separate Pages, but divides Pages into smaller memory blocks managed by PoolSubpage.

First, let’s look at the creation process of PoolSubpage, because the allocated memory is less than 8K, it goes to the allocateSubpage source code:

    private long allocateSubpage(int normCapacity) {

        // Find the head node corresponding to the subpage array in PoolArena based on the memory size
        PoolSubpage<T> head = arena.findSubpagePoolHead(normCapacity);

        int d = maxOrder; // Start searching from the bottommost level of the complete binary tree because the allocated memory is less than 8K

        synchronized (head) {
            int id = allocateNode(d); // Find an available node in the complete binary tree
            if (id < 0) {
                return id;
            }

            final PoolSubpage<T>[] subpages = this.subpages; // Record which pages are converted to subpages
            final int pageSize = this.pageSize; 
            freeBytes -= pageSize;

            int subpageIdx = subpageIdx(id); // Convert pageId to subpageId, for example, pageId=2048 corresponds to subpageId=0
            PoolSubpage<T> subpage = subpages[subpageIdx];
private void allocate(PoolThreadCache cache, PooledByteBuf<T> buf, final int reqCapacity) {

    final int normCapacity = normalizeCapacity(reqCapacity);

    if (isTinyOrSmall(normCapacity)) { // capacity < pageSize

        int tableIdx;

        PoolSubpage<T>[] table;

        boolean tiny = isTiny(normCapacity);

        if (tiny) { // < 512

            if (cache.allocateTiny(this, buf, reqCapacity, normCapacity)) {

                return;

            }

            tableIdx = tinyIdx(normCapacity);

            table = tinySubpagePools;

        } else {

            if (cache.allocateSmall(this, buf, reqCapacity, normCapacity)) {

Netty’s memory allocation and deallocation strategy is based on the concept of memory pools. The memory pools are divided into three categories: Tiny, Small, and Normal. The Tiny pool is for memory chunks less than 512 bytes, the Small pool is for memory chunks between 512 bytes and 8 KB, and the Normal pool is for memory chunks larger than 8 KB.

When allocating memory, Netty first checks if the requested memory size falls into the Tiny or Small category. In this case, it tries to allocate memory from the PoolThreadCache, which is a thread-local cache. If the allocation fails, it falls back to the PoolArena for further allocation.

If the requested memory size is larger than the Tiny and Small categories, but smaller than the default Chunk size (16 MB), it falls into the Normal category. The allocation process follows a similar path as Tiny and Small allocations, trying to allocate from the PoolThreadCache first, and then falling back to the PoolArena if necessary.

If the requested memory size is larger than the default Chunk size (16 MB), it bypasses the PoolThreadCache and directly allocates memory from the PoolArena.

When deallocating memory, Netty follows a similar strategy. Memory blocks are cached in the PoolThreadCache for reuse. The PoolThreadCache keeps track of the number of allocations and triggers a trim operation after a certain number of allocations (default is 8192 allocations). The trim operation clears the cached memory blocks in the PoolThreadCache to avoid memory waste.

Overall, Netty’s memory management strategy aims to optimize memory allocation and deallocation for better performance and reduced memory fragmentation. }

}

private int free(int max, boolean finalizer) {

int numFreed = 0;

for (; numFreed < max; numFreed++) {

    Entry<T> entry = queue.poll();

    if (entry != null) {

        freeEntry(entry, finalizer);

    } else {

        // all cleared

        return numFreed;

    }

}

return numFreed;

}

The frequency of memory allocation is measured by size - allocations, where size is the memory specification size corresponding to the MemoryRegionCache, and size is a fixed value, for example, 512 by default for the Tiny type. Allocations represents how many times the MemoryRegionCache has been allocated since the last memory reorganization. When the number of times the allocate method is called is smaller than size, it indicates that the memory blocks cached in the MemoryRegionCache are not frequently used, and the memory blocks are dequeued and released one by one.

In addition, Netty also recycles all the memory of the thread when the thread exits. PoolThreadCache overrides the finalize() method and executes the cache recycling logic before destruction. The corresponding source code is as follows:

@Override

protected void finalize() throws Throwable {

try {

    super.finalize();

} finally {

    free(true);

}

}

void free(boolean finalizer) {

if (freed.compareAndSet(false, true)) {

    int numFreed = free(tinySubPageDirectCaches, finalizer) +

            free(smallSubPageDirectCaches, finalizer) +

            free(normalDirectCaches, finalizer) +

            free(tinySubPageHeapCaches, finalizer) +

            free(smallSubPageHeapCaches, finalizer) +

            free(normalHeapCaches, finalizer);

    if (numFreed > 0 && logger.isDebugEnabled()) {

        logger.debug("Freed {} thread-local buffer(s) from thread: {}", numFreed,

                Thread.currentThread().getName());

    }

    if (directArena != null) {

        directArena.numThreadCaches.getAndDecrement();

    }

    if (heapArena != null) {

        heapArena.numThreadCaches.getAndDecrement();

    }

}

}

When the thread is destroyed, PoolThreadCache releases all the memory data in the MemoryRegionCache one by one. The core logic of the free method is the same as the process of releasing memory in the previous memory reorganization trim, which interested students can look up in the source code.

So far, we have finished analyzing the allocation and release principles of the entire Netty memory pool. The ingenious design ideas and implementation details of the source code are valuable resources for us to learn.

Summary #

Finally, let’s summarize the design ideas of the Netty memory pool:

  • Manage memory in four memory specifications, namely Tiny, Small, Normal, and Huge. PoolChunk is responsible for managing memory allocations larger than 8K, and PoolSubpage is used to manage memory allocations smaller than 8K. When applying for memory larger than 16M, it is not passed through the memory pool and is assigned directly.
  • Designed a local thread cache mechanism PoolThreadCache to improve the concurrency performance of memory allocation. When applying for Tiny, Small, and Normal types of memory, it will first try to allocate from PoolThreadCache.
  • PoolChunk uses a buddy algorithm to manage Pages, implemented as a binary tree data structure, and it is the core of the entire memory pool allocation.
  • After a certain number of calls to the allocate() method of PoolThreadCache, the usage frequency of the cached memory blocks in PoolThreadCache is checked. Memory blocks with low usage frequency will be released.
  • When a thread exits, Netty recovers all memory associated with that thread.

The introduction of memory pool management technology similar to jemalloc in Netty can be said to be a breakthrough, which has further improved the performance of Netty. This kind of thinking can not only be used in Netty, but also in many cache scenarios. I hope these excellent design ideas can help you and be applied in practical work.