14 How MC Responds to Common Issues in the Era of Big Data

14 How MC Responds to Common Issues in the Era of Big Data #

Hello, I am your instructor Chen Bo for the caching class, welcome to Lesson 14 “Classic issues and solutions of Memcached”.

Memcached Classic Issues in the Era of Big Data #

With the rapid development and popularization of the Internet, humanity has entered the era of big data. In the era of big data, mobile devices have become fully integrated into people’s work and life, and all kinds of data are being produced, mined, and consumed at an unprecedented speed. Mobile Internet systems are also constantly evolving and developing, storing, computing, and analyzing massive amounts of data to meet users’ needs. In the era of big data, large and medium-sized Internet systems have the following characteristics.

First, the amount of data stored in the system is enormous. For example, a microblogging system receives hundreds of millions of new records every day, and the historical data reaches tens of billions or even hundreds of billions of records.
Secondly, there are many users and a huge amount of traffic. The peak traffic can reach millions of queries per second (QPS) daily.
In order to store hundreds of billions or even trillions of massive data and satisfy the high concurrency of a large number of users, internet systems need to deploy a large number of service instances. Many large and medium-sized internet systems need to deploy tens of thousands or even hundreds of thousands of service instances.
Furthermore, due to the era of big data, social information is becoming flattened, and hot events or emergencies can easily cause a large number of off-site users to concentrate their attention, resulting in a traffic peak.
Finally, there is a probability of hardware failure for any hardware resource, and there is a 4-year failure effect, that is, the probability of failure of service resources increases sharply after 4 years of use. Due to the deployment of large and medium-sized Internet systems, a large number of servers, routers, and switches are used, and they are deployed in different IDCs in multiple regions. Many service resources are used for much longer than 4 years, so it is quite common to have hardware failures or network access anomalies in some areas.

Because internet systems use Memcached extensively as a cache, in the process of using Memcached, it is also affected by the system characteristics mentioned above, resulting in specific classic issues.

Capacity Issues #

The first issue is capacity. In the usage of Memcached, in addition to the memory occupied by storing data, memory space is also occupied by connection read/write buffers, hash table allocation, auxiliary thread processing, and process running. Moreover, the operating system itself also occupies a considerable amount of memory. In order to ensure the stable operation of Memcached, the memory setting of Memcached is generally set to 80% of the physical memory. Additionally, the set memory is not entirely used to store effective data. As I mentioned in the previous lesson, when storing each Item data in chunks, there will be some wasted bytes. Also, after expiry or invalidation, the key is not deleted immediately, but is cleared using delayed elimination and asynchronous LRU queue scan. These temporarily uneliminated or expired/invalidated keys also occupy a considerable amount of storage space. In the current era of big data, many core businesses in internet systems require caching hot data of over 300-500GB, which far exceeds the capacity of physical memory on a single machine.

Performance Bottlenecks #

The second issue is the performance bottleneck. For the sake of system stability, the maximum QPS (queries per second) for accessing the online Memcached should be below 100,000 to 200,000. Exceeding this limit may result in slow query issues. However, for large-scale internet systems, the cache requests for core businesses can reach millions of QPS. It is difficult to meet the business requirements of a live environment by simply deploying a single physical machine or a single resource pool.

Connection bottleneck #

The third issue is the connection bottleneck. For stability reasons, the number of connections for online Memcached should be controlled below 100,000 to avoid excessive connections occupying a large amount of memory, resulting in decreased hit rates, or even slow query timeouts. For large-scale systems, online instances can reach tens or even hundreds of thousands. The minimum and maximum connections for a single instance are generally set between 5 and 60. The number of connections for business instances far exceeds the stable support range of a single machine.

Localized hardware failures #

The fourth issue is the availability problem of the cache system caused by localized hardware failures. Any hardware resource has a certain probability of failure, and the failure rate increases steeply after 4 years of use. With tens of thousands of hardware devices, machine failures can occur at any time, resulting in decreased performance and downtime of Memcached nodes, massive penetration of access to the database, and overload of the database, ultimately leading to a system-wide unavailability and avalanche.

Rapid scaling during traffic peaks #

The fifth issue is how to scale rapidly during traffic peaks. In the era of big data, due to the flattening of information diffusion, when emergencies or major events occur, a huge number of users flock to the system, causing a massive influx of traffic in a short period. The system’s access volume increases by more than 70% compared to the daily peak, and a large number of extremely hot keys are accessed. The access volume of the Memcached nodes where these extremely hot keys are located can increase by 2 to 3 times or more compared to the daily peak, easily causing CPU spikes, full bandwidth, and severe machine overload.

Memcached classic issues and countermeasures #

To solve these issues that large-scale internet systems face when using Memcached, we can adopt the following solutions.

Splitting the cache pool #

Firstly, the core business data in the system should be split, so that data with high access volume can use separate cache pools. Each cache pool should have 4 to 8 nodes, which can support a large enough capacity while avoiding excessive pressure on a single cache node. The distribution strategy for cache pools can use consistent hashing and modulo hashing.

In the consistent hashing distribution algorithm, the hash value of the Memcached service node is calculated first, and it is continuously distributed in a circle. Each cache node actually includes N hash points of various sizes. As shown in the following figure, when storing or requesting data, the same hash algorithm is applied to the key, and it is mapped to the circle. Then, starting from the mapped position in a clockwise direction, the first Memcached node found is the target access node.

The hash modulo distribution algorithm is relatively simple. After hashing the key, we take the modulo of the Mc node count to find the target Mc node for storage and retrieval.

During the operation of the system, it is inevitable that Mc nodes will experience failures, sometimes even multiple failures in a short period of time. In the event of a failure, if consistent hashing distribution is adopted, it is convenient to redistribute the hash points and access loads of the failed Mc node to other Mc nodes through rehashing strategy. If modulo distribution is adopted, it will directly result in a 1/N access miss, where N is the number of nodes in the Mc resource pool.

Therefore, for a single-layer Mc cache architecture, consistent hashing distribution combined with rehashing strategy is a better solution. By splitting the business data into independent Mc resource pools and using appropriate distribution algorithms in each resource pool, it can effectively solve the capacity issue, performance bottleneck, and connectivity bottleneck in Mc usage.

Master-Slave Two-Tier Architecture #

When the system’s access volume is large, such as when the peak QPS reaches above 200,000, if the cache nodes fail, even with consistent hashing, it will still put significant pressure on the DB for a period of time, leading to a large number of slow queries and access timeouts. Additionally, if certain cache servers experience multiple failures and repeatedly go online and offline, multiple rehashes will also result in dirty data. To address this, a two-tier architecture of Master-Slave can be adopted.

In this architecture, the master Memcached cache pool, which is normally accessed by the business, is followed by a slave resource pool as a hot backup for the master. The slave resource pool also consists of 6 to 8 nodes, with memory settings using only 1/2 to 1/3 of the master’s memory. Because the main purpose of the slave application is to ensure that the overall hit rate of the Mc cache pool does not significantly decrease in the event of a miss or exception in the master, there is no need to allocate a large amount of memory for the slave.

For regular access, for read operations, we directly access the master. If there is a cache miss, then we access the slave. If the slave hits, the key retrieved will be written back to the master. For write operations, such as set and touch commands, we directly update the master and slave. For commands such as CAS and append, we follow the master as the reference. After a successful CAS or add operation in the master, the key is directly set in the slave to maintain data consistency between the master and slave.

In the figure below, when some nodes in the master are abnormal, they are taken over by the slave layer. The abnormality of some nodes at any layer will not affect the overall cache hit rate, request latency, and other SLA metrics. At the same time, the hash modulo distribution scheme is used, and there is no rehashing after the Mc node becomes abnormal, thus avoiding the problem of dirty data caused by consistent hashing during rehashing.

The Master-Slave architecture can effectively handle partial device failures in scenarios with large access volume. In the event of abnormal nodes or cache misses, it takes an additional 1ms of time to access the slave resources, achieving the goal of trading time for overall system availability.

M-S-L1 Architecture #

At the beginning of the 20th century, Italian statistician Pareto proposed a viewpoint: in any specific group, a small number of important factors usually account for the majority, while the majority of factors are not important. Therefore, as long as the few important factors can be controlled, the overall situation can be controlled. This theory has evolved over many years and has become known as the 80/20 principle. The 80/20 principle also widely applies to Internet systems, such as 80% of user visits being concentrated on 20% of the system’s functions, and 80% of requests being concentrated on 20% of the data. Therefore, Internet system data exhibits clear distinctions between hot and cold data, and this distinction is often greater than 80/20. For example, data from platforms like Weibo and WeChat is highly accessed on the most recent day, while data from a week ago is rarely accessed. Additionally, among the recent hot data, some feed information is shared and interacted with much more frequently than most other data, forming a clear pattern of head requests.

Head requests result in a large number of daily visits and are concentrated on a small fraction of keys. Moreover, during sudden news or major events, the number of requests can increase by over 50-70% in the short term, and these requests are concentrated on keys related to the events, resulting in the appearance of many hot keys. Hot keys have randomness, and if concentrated on a few nodes, it can cause the workload on these nodes to increase by several times, leading to severe overload and a significant number of slow or timed-out queries.

To handle the daily peak of hot data access, especially when dealing with the extreme access caused by sudden events, we can add an L1 layer. As shown in the diagram below, the L1 layer includes 2-6 sets of L1 resource pools, with 4-6 nodes per pool, but with only about 1/10 of the memory capacity of the master.

As shown in the diagram, for read requests, an L1 is randomly selected for reading. If the data is missing, the master is accessed. If the master also misses, the slave is accessed. Along the way, as long as any layer hits, the previous resource pool is written back.

For write requests, similar to the Master-Slave architecture, for set commands, all three layers of resources are directly set. For add/cas/append operations, the master is the reference. After the master operation is successful, the last key/value is set to all the L1 and slave resource pools.

Since the L1 memory is only 1/10 of the master’s and L1 is accessed first, only the hottest keys are retained in L1. When a key becomes slightly colder, it is moved to the COLD LRU tail and eventually eliminated. Although L1 has a small memory capacity, it can satisfy 60-80% or more of the system’s request data since it only contains the hottest data with the greatest system access traffic. This is consistent with the 80/20 principle.

The master stores the full amount of hot data to handle the access traffic when L1 misses or is abnormal. The slave is used to store the majority of hot data and differs from the master to handle the access traffic of L1 and master when they miss or are abnormal.

There is an area for further optimization, which is to ensure the heat of the master and slave. To retain the hottest portion of the data in the master and slave, a certain probability can be set for reading from the master or slave when accessing L1 to ensure that the hottest keys are accessed and not evicted by the master or slave. At this time, the access path needs to be slightly adjusted. If the master is accessed first and misses, the slave is accessed next. If the slave is accessed first and misses, the master is accessed next. With the Master-Slave-L1 architecture, we can use minimal resources to rapidly deploy multiple sets of L1 resource pools and add them to the L1 layer, thereby significantly improving the system’s ability to handle peak loads. In this way, extreme hot keys are distributed among N sets of L1, and each L1 resource pool is only responsible for 1/N of the requests. In addition to peak load handling, this architecture can easily cope with local failures and avoid cascading failures.

This lesson explains the characteristics of large and medium-sized Internet systems in the era of big data, classic issues and solutions when accessing Memcached caches, how to solve capacity problems, performance bottlenecks, connection bottlenecks, and local failures by dividing the cache pool and using a Master-Slave dual-layer architecture, and how to better address sudden peak traffic and local failures by using a three-layer architecture: Master-Slave-L1 with multi-layer and multi-copy Memcached system.

The mind map below can be referenced for a review and summary of these key points.