08 How Is the MC System Architecture Laid Out

08 How is the MC System Architecture Laid Out #

Hello, I am your cache teacher Chen Bo. Welcome to Lesson 8, “Memcached System Architecture”.

System architecture #

Let’s take a look at the system architecture of MC.

As shown in the diagram below, MC’s system architecture mainly consists of five parts: network processing module, multi-threaded processing module, hash table, LRU, and slab memory allocation module. MC implements the network processing module based on Libevent to handle user requests concurrently with multiple threads. It uses the hash table to quickly locate keys, LRU (Least Recently Used) to manage the eviction of cold data, and the slab mechanism for fast memory allocation and storage.

System architecture #

MC has implemented a multi-threaded network model based on Libevent. MC’s multi-threaded network model consists of a main thread and worker threads. These threads use multiplexing IO for network IO access and read/write processing. In Linux, epoll is commonly used. With multiplexing IO, especially the use of epoll, MC threads do not need to iterate over the entire set of listened descriptors. They only need to iterate over the ready queue of descriptors after being notified. These descriptors are asynchronously notified by the kernel IO events only after all preparations are done. In other words, IO operations are only performed when the connection is ready. This avoids blocking and allows MC to support high concurrency while achieving very high IO throughput efficiency.

In addition to the main thread and worker threads used for IO, MC also has multiple auxiliary threads, such as the Item crawler thread, LRU maintenance thread, and hash table maintenance thread. By working concurrently with multiple threads, MC can fully utilize the multiple cores of the machine and achieve good network IO performance and data processing capability.

MC uses a hash table, specifically Hashtable, to quickly locate keys. When data is stored, the data item structure is stored in a chunk within a slab and is also stored in the hashtable. Meanwhile, for each bucket in the hash table, MC uses an item to record a singly linked list to solve hash collisions for different keys in the hash table. When looking up a given key’s item, the hash value of the key is first calculated, and then a search is performed in the bucket that corresponds to the hash value in the hash table. By polling the singly linked list in the bucket, the item pointer corresponding to the key is found, and thus the storage item corresponding to the key is found, as shown in the diagram below.

Under normal circumstances, MC’s insert and lookup operations on the hash table are performed in the main table. When the number of items in the table exceeds 1.5 times the number of hash table bucket nodes, the hash table is resized. As shown in the diagram below, during the resizing process, MC internally uses two hashtables: the primary hashtable and the old hashtable. When resizing starts, the original primary hashtable becomes the old hashtable, while a new hashtable with double the capacity is allocated as the new main table. During the migration process, the maintenance thread gradually copies and inserts the item pointers from the old table to the new main hashtable. During the migration, user requests will simultaneously query the old table and the new main table based on the migration position. When all the data has been migrated, all operations will be performed in the main table again.

LRU mechanism #

Mc mainly uses the LRU mechanism for cold data eviction. Since version 1.4.24, Mc has continuously optimized the LRU algorithm and the current version of Mc has enabled segmented LRU by default. Before enabling segmented LRU, each slabclass ID corresponds to only one COLD LRU, and when memory is insufficient, data is directly evicted from the COLD LRU. However, after enabling segmented LRU, each slabclass ID has four LRUs: TEMP, HOT, WARM, and COLD.

As shown in the figure below, the remaining expiration time of the items in the TEMP LRU is usually short, typically within 61 seconds. Items in this queue will never be moved within the queue or migrated to other queues. When inserting a new key/value pair, if the remaining expiration time of the key is less than 61 seconds, it will directly enter the TEMP LRU. Later, expiration can be performed when necessary. This avoids lock contention and improves performance.

For the HOT LRU, there is no movement within the queue. When the queue is full, if the last item in the queue is in active state, meaning it has been accessed, it will be migrated to the WARM queue, otherwise it will be migrated to the COLD queue.

For the WARM LRU, if an item in the queue is accessed again, it will be moved to the head of the queue, otherwise it will be migrated to the COLD queue.

For the COLD LRU, it stores the least active items. Once the memory is full, the last item in the queue will be evicted. If an item in the COLD LRU is accessed again, it will be migrated to the WARM LRU.

Slab Allocation Mechanism #

In general, memory allocation in application systems is done directly using malloc and free. Over time, the number of memory fragments increases, greatly increasing the burden on the system memory manager. The continuous generation of fragments not only leads to a lot of memory waste, but also makes fragment compaction more complex, resulting in slower and worse memory allocation speed and storage efficiency. The introduction of the slab allocation mechanism in Mc solves the problem of memory fragmentation. Let’s first understand the Mc’s slab allocation mechanism.

Mc uses the slab mechanism to allocate and manage memory, as shown in the figure below. It can be said that the use of slab allocation mechanism is the key to Mc’s high-performance allocation and storage. When Mc starts, it creates 64 slab classes, but slab class with index 0 is used for slab redistribution and is not involved in the regular allocation activities of other slab classes. Each slab class continuously allocates slabs with a default size of 1MB as needed.

Each slab is divided into chunks of the same size. Chunk is the basic storage unit for storing data in Mc. The chunk size of slab class 1 is the smallest, with a default minimum chunk size of 102 bytes, and the following slab classes will gradually increase the chunk size according to the growth factor, and the specific value will be further rounded to the nearest multiple of 8. The default growth factor of Mc is 1.25, and it can be set to other values using -f at startup. For example, with the default value, the chunk size of slab class 1 is 102, the chunk size of slab class 2 is 102×1.25, and after rounding to the nearest multiple of 8, it is 128.

The chunks in the Mc slab contain key/value key-value pairs through the Item structure. The Item structure’s header stores the pointers, flags, expiration time, etc. of the linked list, and then stores the key and value. In general, an Item will not fill the entire chunk, but since each key/value is stored according to the key/value size, choosing the closest slab class, the wasted bytes in the chunk are very limited and can be ignored.

After allocating a new slab, the slab space is divided into chunks of the same size, and these chunks are added to the freelist of the slab class for allocation when needed. The chunks that are allocated and store Item data will reenter the freelist after being evicted upon expiration, available for future use.