12 Stone From Other Mountains, the Basic Principles of the High Performance Memory Allocator Jemalloc

12 Stone from other mountains, the basic principles of the high-performance memory allocator jemalloc #

In the previous lesson, we introduced the powerful ByteBuf utility class, which is widely used in Netty. But how are these ByteBuf objects allocated and managed in Netty? In the following, we will analyze the high-performance memory management in Netty. This knowledge may be a bit difficult to understand compared to the previous chapters, but don’t worry too much. The implementation of memory management in Netty is not achieved overnight. It also refers to the jemalloc memory allocator. Today, we will first introduce the basic principles of the jemalloc memory allocator to lay a foundation for our upcoming lessons.

Background knowledge #

jemalloc is a new generation memory allocator introduced by Jason Evans in the FreeBSD project. It is a generic implementation of malloc that focuses on reducing memory fragmentation and improving memory allocation efficiency in high-concurrency scenarios. Its goal is to replace malloc. jemalloc is widely used and can be found in famous products or programming languages such as Firefox, Redis, Rust, and Netty. For specific details, you can refer to the paper “A Scalable Concurrent malloc Implementation for FreeBSD” published by Jason Evans.

In addition to jemalloc, there are also some famous memory allocator implementations in the industry, such as ptmalloc and tcmalloc. Let’s make a simple comparison of these three memory allocators:

ptmalloc is a memory allocator based on glibc, which is a standard implementation, so it has better compatibility. “pt” stands for “per thread”. Of course, ptmalloc did put a lot of effort into optimizing performance in multi-threading. Due to excessive consideration of performance issues, memory cannot be shared among multiple threads, so each thread can only independently use its own memory, resulting in a large waste of memory overhead.

tcmalloc originated from Google, the full name is “thread-caching malloc”, so the biggest feature of tcmalloc is the thread cache. tcmalloc is very famous and is currently used in well-known products such as Chrome and Safari. tcmalloc allocates a local cache for each thread. For the allocation of small objects, it can be directly completed by the thread’s local cache. For the allocation scenario of large objects, tcmalloc tries to use spin locks to reduce the lock contention issue in multi-threading.

jemalloc learns from the excellent design ideas of tcmalloc, so the two have many similarities in architectural design, including the feature of thread cache. However, jemalloc is more complex in design than ptmalloc and tcmalloc. jemalloc divides the memory allocation granularity into three categories: Small, Large, and Huge and records a lot of metadata. Therefore, it occupies slightly more space than tcmalloc. However, in the scenario of allocating large memory, jemalloc has less memory fragmentation than tcmalloc. tcmalloc internally uses a red-black tree to manage memory blocks and pages. The Huge objects can control the index data lookup in exponential time using the red-black tree.

From this, although the emphasis of these memory allocators is different, their core goals are the same:

  • Efficient memory allocation and deallocation to improve performance in single-threaded or multi-threaded scenarios.
  • Reduce memory fragmentation, including internal fragmentation and external fragmentation, and improve the effective utilization of memory.

So here we come across another concept, what is memory fragmentation? Physical memory in Linux is divided into several 4K-sized memory pages Pages. Memory allocation and deallocation in physical memory are based on Pages. The memory fragmentation generated within a Page is called internal fragmentation, and the memory fragmentation between Pages is called external fragmentation.

First, let’s talk about internal fragmentation. Since memory is allocated based on Pages, even if we only need a small amount of memory, the operating system will allocate at least a 4K-sized Page. Only a portion of the bytes within a single Page are used, and the remaining bytes form internal fragmentation, as shown in the following image.

Drawing 0.png

External fragmentation is the opposite of internal fragmentation and occurs when allocating larger memory blocks. Let’s imagine that when we need to allocate a large memory block, the operating system can only satisfy the requirement by allocating continuous Pages. In the process of the program running constantly, these Pages are frequently reclaimed and reallocated, resulting in small free memory blocks between Pages, forming external fragmentation, as shown in the following image.

Drawing 1.png

In the above, we introduced some background knowledge about memory allocators. They are essential tools for operating systems and high-performance components. If you are interested in memory management, it is highly recommended to learn jemalloc and tcmalloc.

Common Memory Allocation Algorithms #

Before diving into the implementation details of jemalloc, let’s first understand the most commonly used memory allocation algorithms: Dynamic Memory Allocation (DMA), Buddy Algorithm, and Slab Algorithm. This will greatly benefit our understanding of jemalloc.

Dynamic Memory Allocation #

Dynamic Memory Allocation, also known as heap memory allocation, is commonly referred to as DMA. The operating system dynamically allocates memory based on the program’s needs during runtime, and the size of the allocated memory is exactly the size required by the program. In most scenarios, the size of memory needed can only be determined during program execution. Pre-allocating memory may result in uncontrollable sizes, leading to wastage of space if allocated too large or inability to use if allocated too small.

DMA allocates memory on-demand from a contiguous block of memory. Metadata is recorded for each allocated memory, and free memory is maintained using a free block list, facilitating the search for available free blocks during memory allocation. There are three commonly used search strategies:

The first one is the First Fit algorithm. The free block list is connected in a doubly linked list format in the ascending order of addresses. The First Fit algorithm finds the first free block that meets the allocation requirements from the free block list and allocates a block of usable memory from the free block. The remaining free blocks are still kept in the free block list. As shown in the following diagram, the requests for P1 and P2 can be satisfied by allocating memory in Memory Block A. With this algorithm, allocations always start from the lowest address, leading to continuous allocations towards the lower addresses and many small free blocks.

image1.png

The second one is the Next Fit algorithm, which is a variation of the First Fit algorithm. Instead of starting from the beginning of the list every time, the Next Fit algorithm starts searching for free blocks from the next free block after the previously found one. As shown in the following diagram, when allocating memory for P1 in Memory Block A, it continues to search for available blocks downwards, eventually allocating memory in Memory Block B for P2. Compared to the First Fit algorithm, the Next Fit algorithm provides a more uniform distribution of free blocks and improves search efficiency. However, this results in fewer large free blocks in the free block list.

image2.png

The third one is the Best Fit algorithm. The free block list is connected in a doubly linked list format, with free blocks sorted in increasing order of size. Each time, the Best Fit algorithm starts searching from the beginning of the free block list, so the first free block that meets the allocation requirements is the optimal solution. As shown in the following diagram, after allocating memory for the P1 request in Memory Block A, the free block list is re-sorted based on block size, and then it searches for a free block that meets the requirements for the P2 request. This algorithm has a higher space utilization, but it also leaves behind many small free blocks that are difficult to utilize. Since the list needs to be sorted after each allocation, there is a performance cost associated with the Best Fit algorithm.

image3.png

Buddy Algorithm #

The Buddy Algorithm is a very classic memory allocation algorithm that uses a segregated fit design. It divides physical memory into sets of memory blocks of different sizes, each aligned to a power of 2. Memory is allocated based on the required size aligned to a power of 2. For example, if we request a memory block size of 10KB, it will be allocated based on a 16KB size.

The Buddy Algorithm is relatively complex, so let’s explain its allocation process with the help of the following diagram.

image4.png

The Buddy Algorithm divides memory into 11 groups of memory block collections, each of which represents a power of 2 in size. Each group of memory block collections is linked in a doubly linked list. The size of each memory block in the list is 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, and 1024 consecutive pages, respectively. For example, the nodes in the first group list represent contiguous blocks of 2^0 pages, the nodes in the second group list represent contiguous blocks of 2^1 pages, and so on.

Let’s consider allocating a memory block of size 10K and see how the Buddy Algorithm works:

  1. First, we need to find the list that corresponds to a contiguous block of 2^4 pages, which is the list at array index 4.
  2. Check if there is any free memory block in the 2^4 linked list. If there is, allocate it successfully.
  3. If there is no free memory block in the 2^4 linked list, continue searching upwards in the array to locate the linked list at index 5. Each node in this linked list stores a continuous set of 2^5 pages.
  4. If there is a free memory block in the 2^5 linked list, take out that block and split it into two 2^4 size memory blocks. Allocate one block for process usage and link the remaining block to the 2^4 linked list.

The above is the allocation process of the buddy algorithm. Now, what happens when memory is freed in the buddy algorithm? When a process finishes using memory and returns it, it needs to check if its buddy block’s memory is also freed. The buddy block refers to another block with the same size that has consecutive addresses, and the starting address of the lower address block must be a power of 2. If the buddy block is free, the two memory blocks will be merged into a larger block, and the buddy block check mechanism described above will be repeated. This process continues until the buddy block is not free, and then the memory block will be returned to the corresponding linked list based on its actual size. Not every release triggers a merge operation, as frequent merging can waste CPU resources. When the number of memory blocks in the linked list is less than a certain threshold, the merge operation is not triggered.

From this, it can be seen that the buddy algorithm effectively reduces external fragmentation but may cause significant internal fragmentation, with the most extreme case resulting in 50% memory fragmentation.

Slab Algorithm #

Since the buddy algorithm uses a page as the smallest management unit, it is not suitable for small memory allocation scenarios. Allocating a whole page each time would be very wasteful. Therefore, the slab algorithm was developed. The slab algorithm optimizes the buddy algorithm for small memory scenarios and adopts a memory pool solution to solve the problem of internal fragmentation.

The Linux kernel uses the slab algorithm because the kernel frequently allocates small memory. The slab algorithm provides a high-speed cache mechanism that stores kernel objects in the cache. When the kernel needs to allocate memory, it can basically get it from the cache. In addition, the slab algorithm can support the initialization of universal objects, avoiding the overhead of repeated object initialization. The following figure is a structural diagram of the slab algorithm. The implementation of the slab algorithm is very complex, and this article only provides a simple understanding.

image.png

In the slab algorithm, the Slab set of different sizes is maintained. At the top level is the cache_chain, which maintains a set of kmem_cache references. The kmem_cache is responsible for managing a fixed-size object pool. Typically, a block of memory is allocated in advance and divided into slots of the same size. The memory blocks are not merged, and a bitmap is used to record the usage of each slot.

The kmem_cache contains three Slab linked lists: slab_full for fully allocated slabs, slab_partial for partially allocated slabs, and slabs_empty for fully idle slabs. These three linked lists are responsible for memory allocation and release. Each Slab maintained in the linked list consists of one or more consecutive pages, and each Slab is allocated multiple objects for storage. The slab algorithm manages memory based on objects and classifies objects of the same type. When allocating memory, the appropriate memory unit is allocated from the Slab linked list; when releasing memory, the slab algorithm does not discard the already allocated objects but keeps them in the cache. When objects are allocated memory again, the most recently released memory block will be used directly.

A single Slab can move between different linked lists. For example, when a Slab is fully allocated, it will move from slab_partial to slab_full. When an object in a Slab is released, it will move from slab_full back to slab_partial. If all objects in a Slab are released, it will move from slab_partial to slab_empty.

So far, the three most commonly used memory allocation algorithms have been introduced. Excellent memory allocation algorithms seek a balance between performance and memory utilization. The protagonist of today’s article, jemalloc, is a very typical example.

jemalloc Architecture Design #

After understanding the commonly used memory allocation algorithms, it will be relatively easy to understand the architecture design of jemalloc. The following figure is the architecture diagram of jemalloc, and we will learn about its core design principles together.

image

The diagram above involves several core concepts of jemalloc, such as arena, bin, chunk, run, region, and tcache. We will introduce them one by one below.

Arena is the most important part of jemalloc. Memory is managed by a certain number of arenas. Each user thread is bound to an arena, and threads use round-robin to select an available arena for memory allocation. To reduce lock contention between threads, by default, each CPU is assigned 4 arenas. bin is used to manage memory units of different sizes, and the size of memory managed by each bin increases in a categorized manner. Because the allocation of small memory in jemalloc is completed based on the Slab algorithm, it produces memory blocks of different categories.

chunk is a data structure responsible for managing user memory blocks, and chunk manages memory in units of pages, with a default size of 4M, which is equivalent to 1024 consecutive pages. Each chunk can be used for multiple small memory allocations, but in the case of large memory allocations, it can only be allocated once.

run is actually a memory area within a chunk, and each bin manages runs of the same type. Memory allocation is completed by manipulating runs. The specific size of a run is determined by different bins. For example, a bin with a size of 8 bytes corresponds to a run with only one page, from which 8-byte blocks can be allocated.

region is a group of small memory blocks within each run, and each run is divided into several equally long regions. Memory allocation is also distributed according to regions.

tcache is a per-thread private cache used for memory allocation in small and large scenarios. Each tcache corresponds to an arena. Tcache itself also has a bin array called tbin, which is different from the bin in arena. It does not have the concept of run. Tcache requests a batch of memory from the arena and first checks the tcache for allocation to avoid lock contention. If the allocation fails, it will be done through runs.

Now let’s summarize the relationship between these concepts in jemalloc:

  • Memory is managed by a certain number of arenas, and threads are evenly distributed among arenas.
  • Each arena contains a bin array, with each bin managing memory blocks of different sizes.
  • Each arena is divided into several chunks, and each chunk contains several runs. Each run consists of consecutive pages and is the actual object for memory allocation.
  • Each run is divided into a certain number of regions. In small memory allocation scenarios, a region is equivalent to a user memory.
  • Each tcache corresponds to an arena, and tcache contains multiple types of bins.

Next, let’s analyze jemalloc’s overall memory allocation and deallocation process, which can be divided into three scenarios: Small, Large, and Huge.

First, let’s talk about the Small scenario. If the requested memory size is smaller than the minimum bin in the arena, it is preferred to allocate from the corresponding tcache of the thread. First, check if there are cached memory blocks in the corresponding tbin. If there are, allocate successfully. Otherwise, find the arena corresponding to the tbin and allocate a region from the bin in the arena, which is saved in the avail array of the tbin. Finally, select an address from the avail array for memory allocation. When memory is released, the reclaimed memory blocks are cached.

Memory allocation in the Large scenario is similar to the Small scenario. If the requested memory size is larger than the minimum bin in the arena, but not larger than the largest block that can be cached in the tcache, allocation is still done through the tcache. However, in this case, chunks and their corresponding runs will be allocated from the chunk, and the corresponding memory space will be found in the chunk for allocation. Memory deallocation is also similar to the Small scenario, where released memory blocks are cached in the tbin of the tcache. In addition, if the requested memory size is larger than the largest block that can be cached in the tcache, but not larger than the size of a chunk, the tcache mechanism will not be used, and memory allocation will be done directly in the chunk.

In the Huge scenario, if the requested memory size is larger than the size of a chunk, it will be allocated directly using mmap and deallocated using munmap.

So far, the introduction to the basic knowledge of jemalloc is complete. You need to take some time to digest it, as it will be helpful for studying Netty’s memory management in the future.

Summary #

Memory management is essential knowledge for every advanced programmer. The essence remains the same amidst endless variations, and the ideas of jemalloc are applicable in many scenarios. It has prototypes in well-known high-performance components such as Redis and Netty. You will find that their implementation approaches are similar, focusing on allocating large blocks of memory to avoid the “death by a thousand cuts”. Strike while the iron is hot. In the next lesson, we will continue to learn how Netty designs high-performance memory management.