12 Why MC Can Maintain High Performance in Long Term Read and Write Operations

12 Why MC can Maintain High Performance in Long-term Read and Write Operations #

Hello, I am your cache course instructor, Chen Bo. Welcome to Lesson 12 “Memcached Memory Management Slab Mechanism”.

Slab Mechanism for Memory Management #

After discussing the eviction strategy, we will now learn about the slab mechanism for memory management.

MC allocates memory using the slab mechanism, which can avoid memory fragmentation and is crucial for MC to sustain high performance in data reading and writing.

Slabclass #

MC’s slab mechanism operates through slabclasses, as shown in the following diagram. When MC starts, it constructs an array of slabclasses with a length of 64. Slabclass 0 is used for slab reallocation, while slabclasses 1-63 are used to store data items. Each slabclass that stores data items records its chunk size, and the chunk sizes for different slabclasses increase according to a multiplication factor. The chunk size of the last slabclass (slabclass 63) is set to the maximum chunk size, which is 0.5MB by default. When a slabclass has no free chunks, MC allocates a slab for it with a default size of 1MB. The slab is then split into chunks based on the chunk size of the slabclass. These split chunks are initialized using the Item structure and recorded in the freelist linked list of the slabclass. When a key/value pair needs to be stored in a slabclass, an Item is allocated from the freelist for it to use. Additionally, if an Item expires, is invalidated by flush_all, or is evicted due to insufficient memory, it will be recycled back to the freelist at an appropriate time for future allocations.

img

Storage Slab Allocation #

As shown in the following diagram, MC allocates storage space in units of slabs, and the default size of each slab is 1MB. Therefore, when storing data, MC’s minimum memory allocation unit is 1MB. After allocating this 1MB slab, the slab is further divided into chunks of the same size as the chunk size of the corresponding slabclass. These chunks are used to store Item data, which includes the fields of the Item structure as well as the key/value pairs.

Generally, the Item structure and key/value pairs do not fully fill a chunk and there will be some wasted bytes. However, this byte waste is minimal and can be ignored. Once a slab is allocated in MC, it will not be recycled, but it may be reallocated among different slabclasses based on the runtime situation.

img

When a slabclass has no free chunks and new data needs to be inserted, an attempt is made to add a new slab to it. When adding a new slab to a slabclass, it first tries to reuse a previously allocated slab from the global slabclass 0. If slabclass 0 has no slab available, it will attempt to directly allocate a slab from the memory heap. If the global slabclass 0 has no free slab and the memory allocation of MC has reached the upper limit set by MC, it means that there is no slab available for reallocation at this time, and the allocation of a new slab fails, resulting in a direct return.

Of course, even if slabclass fails to allocate a slab, it does not mean that Item allocation will fail. As mentioned earlier, previously allocated Items can be reclaimed through synchronous LRU eviction and used for new storage requests.

Item #

In Memcached (Mc), the chunk in slabclass is first initialized using the Item structure. It is then stored in the freelist linked list. When data storage is needed, the chunk is retrieved from the freelist and used to store key/value and various auxiliary attributes. It is then stored in the LRU linked list and Hashtable, as shown in the diagram below.

The Item structure has two prev and next pointers used to connect the freelist linked list before allocating it for data storage. After allocation, they are used to connect the corresponding LRU linked list. There is also an h_next pointer used to connect the bucket unidirectional linked list in the Hashtable after allocation. The Item structure also stores expiration time, slabclass ID, key length, cas unique ID, etc. Finally, at the end of the Item structure, the key, flag, value length, and value block data are stored. The chunk space after the value is wasted. The Item is managed by the freelist during the idle period, i.e., initial allocation and after being reclaimed. During the storage period, it is managed by the Hashtable and LRU.

img

Storage Item Allocation #

Memcached (Mc) uses the slab mechanism to manage memory allocation. The allocation of memory for storing key/value is thus converted to the allocation of the Item. During the allocation process, there is a loop that repeats 10 times until the Item space is allocated. If the loop has run 10 times and still no Item space is allocated, the storage fails and returns a SERVER_ERROR response.

During the allocation process, if the slabclass’s freelist has space, it is directly allocated. Otherwise, it tries to allocate a new slab for the slabclass. The new slab is allocated by reusing an idle slab from the global slab pool (i.e., slabclass 0) one by one. If the global slab pool doesn’t have a slab, it tries to allocate directly from memory. After a successful allocation of a new slab, the slab is split based on the chunk size recorded for the slabclass. The split chunks are then initialized with the Item structure and stored in the freelist. If the global slab pool is empty and the memory allocation of Mc has reached the set limit, the attempt to add a new slab fails, and it continues with a smaller loop that repeats 5 times. It tries to reclaim expired keys from the COLD LRU, and if there are no expired keys, it forcibly evicts a normal key from the end of the queue. If the COLD LRU of that slabclass does not have an Item, it processes the HOT LRU and either reclaims or migrates the Item at the end of the HOT linked list to find an available Item space in the next loop.

Data Storage Mechanism #

After discussing Mc’s Hash Table positioning, LRU eviction, and slab memory allocation, let’s look at how Mc stores key/value data. By analyzing the data storage and maintenance process, we can relate and connect Mc’s core modules.

First, let’s see how Mc writes the data into pre-allocated storage space using the slab mechanism.

As shown in the following diagram, when key/value data needs to be stored, the required number of bytes for storing this key/value is calculated based on the key/value size and the size of the Item structure. Then, the slabclass with the smallest chunk size that can accommodate this number of bytes is selected. An available chunk from the freelist of this slabclass is allocated for this key/value. If the freelist is empty, it attempts to allocate a new slab for the slabclass. If the slab allocation is successful, slabs are split into chunks according to the size and then initialized with the Item structure. These Items are then filled into the freelist. If the slab allocation fails, the invalid Item is evicted through LRU eviction or forcibly evicts a normal Item. These evicted Items are also filled into the freelist. The process is retried 10 times until the Item position is allocated. In general, Item allocation will succeed, but there is a very low probability of failure. If the allocation fails, a SERVER_ERROR response is returned, notifying the client that the storage has failed. After allocating an available Item, the expiration time, flag, slabclass ID, key, and value are written into this Item. For the set command, if there is an old value for this key, it is deleted before storing the new value.

img

After successfully allocating an Item for key/value and writing the data, the next step is to store this Item in the Hash Table. Since the Mc Hash Table can be migrated, for normal cases, the Item is directly stored in the main Hash Table. During Hash Table migration, depending on the migration position, the Item is stored in either the main Hash Table or the old Hash Table. After storing the Item in the Hash Table, the key can be quickly located. This Item is also stored in the LRU. Mc determines whether to store it in the TEMP LRU or HOT LRU based on the expiration time of this key.

At this point, the key/value is correctly stored in Mc. The data content is written into a chunk position in the slabclass. This chunk is filled with the Item structure, which is simultaneously recorded in the Hashtable and LRU, as shown in the following diagram. The Hashtable allows fast key lookup, while the LRU is used for the daily maintenance of the Item lifecycle.

img

Mc’s daily maintenance of the Item lifecycle includes asynchronous maintenance and synchronous maintenance. Asynchronous maintenance is performed by the LRU maintenance thread and does not affect normal client requests. In the LRU maintenance thread, expired or invalid keys are reclaimed, and the four LRUs are internally rearranged and migrated. This is the main form of Item lifecycle management. Synchronous maintenance is performed by worker threads while processing command requests. When processing a delete command, the key/value is deleted directly. When storing a new key/value, if the allocation fails, invalid keys are reclaimed or normal Items are forcibly evicted. These evicted Items enter the freelist of the slabclass for reuse.