06 How to Address Problems Triggered by Hot Keys and Big Keys

06 How to Address Problems Triggered by Hot Keys and Big Keys #

Hello, I am your caching teacher Chen Bo. Welcome to Lesson 6 “Classic Problems Related to Cache Keys”.

Hot keys #

Problem description #

The sixth classic problem is Hot Keys. For most Internet systems, data can be classified as hot or cold. For example, recently published news articles or blog posts are accessed most frequently, while older news articles or blog posts are accessed less frequently. However, when a sudden event occurs, a large number of users simultaneously access the hot key associated with this event. The cache node that stores the information of this hot key is prone to overload and lag, or even crash.

Analysis of reasons #

Hot keys cause anomalies in the caching system mainly because during sudden hot events, an extremely large number of requests access the key associated with the hot event. For example, hundreds of thousands or millions of users simultaneously access a new topic on a microblogging platform. When hundreds of thousands of requests all target the same key, the traffic becomes concentrated on a single cache node machine. This cache machine is easily pushed to the limit of its physical network card, bandwidth, and CPU, resulting in slow and lagging cache access.

Business scenarios #

Hot keys can occur in various business scenarios, such as special sudden events like celebrity weddings, divorces, or scandals, major events or festivals like the Olympics or Chinese New Year, and online promotional activities like flash sales or shopping festivals (e.g., “Double 12” or “618”).

Solutions #

To solve the problem of extremely hot keys, we need to first identify these hot keys. For important holidays, online promotions, or concentrated push events that can be predicted in advance, we can proactively evaluate the potential hot keys. For sudden events that cannot be predicted in advance, we can use Spark to perform real-time analysis on streaming tasks and promptly identify newly published hot keys. As for events that have already occurred and gradually become hot keys, we can use Hadoop for offline batch processing to identify high-frequency hot keys in recent historical data.

Once the hot keys are identified, there are several solutions:

We can disperse the hot keys to different cache nodes. For example, if the name of a hot key is “hotkey,” it can be dispersed as “hotkey#1,” “hotkey#2,” “hotkey#3,” … “hotkey#n,” and these n keys can be stored in different cache nodes. When the client sends a request, it randomly accesses one of the suffixes of the hot key, which helps distribute the requests for hot keys and avoid overloading a single cache node.

Another approach is to keep the key name unchanged and design a cache architecture that combines multiple replicas and multiple levels in advance.
If there are many hot keys, we can monitor the cache’s SLA in real-time and quickly scale up to reduce the impact of hot keys.
Lastly, the business can also use local caching to store these hot keys, reducing the impact on remote caching.

Big keys #

Problem description #

The last classic problem is Big Keys, which refers to the phenomenon where when accessing the cache, some keys have values that are too large, resulting in read/write or loading timeouts.

Analysis of the Reasons #

There are many reasons for these slow queries caused by large keys. If these large keys account for a small proportion of the overall data, they are stored in Mc, correspond to fewer slabs, and are easily evicted frequently, resulting in slow queries due to repeated loading from the DB. If there are many large keys in the business and these keys are heavily accessed, the network card and bandwidth of the caching component can easily be saturated, causing more slow queries with large keys. In addition, if there are many fields cached in a large key, each change to a field will trigger a change to the cached data, and these keys will also be frequently read, resulting in interference between reading and writing and causing slow query phenomena. Finally, once a large key is evicted from the cache, loading it from the DB may take a long time, which also leads to slow queries with large keys.

Business Scenarios #

The business scenarios for large keys are also quite common. For example, in an internet system, it is necessary to store the latest 10,000 fans for a user. Another example is caching a user’s personal information, including basic information, relationship graph counts, and feed statistics. Caching the content of a Weibo feed is also likely to occur. Generally, Weibo posts by users are within 140 characters, but many users may also post Weibo messages that are 1,000 characters or even longer, which makes these long Weibo messages large keys, as shown in the following image.

Solutions #

For large keys, three solutions are provided.

The first solution is to design a cache threshold if the data exists in Mc. When the length of the value exceeds the threshold, compression should be applied to the content to keep the KV size as small as possible. Next, evaluate the proportion of the large keys. When Mc is started, immediately pre-write enough data for the large keys so that Mc can pre-allocate enough trunk size and larger slab. This ensures that there is enough space for caching large keys during the subsequent operation of the system.

The second solution is shown in the following image. If the data exists in Redis, for example, if the business data is stored in the set format and the set structure corresponding to the large key has several thousand or even tens of thousands of elements, writing these large keys to Redis will take a long time and cause Redis to become unresponsive. In this case, a new data structure can be extended. At the same time, before clients write these large keys to the cache, they should be serialized and constructed, and then written into Redis all at once using the restore command.

The third solution is shown in the following image. The large key is split into multiple keys to minimize the existence of large keys. Since once a large key penetrates to the DB, loading it takes a long time, special consideration can be given to these large keys, such as setting longer expiration times or not evicting these large keys if there are equivalent conditions.

With this, all 7 classic caching problems have been covered in this lesson.

We need to realize that for internet systems, due to the complexity of actual business scenarios and the huge amount of data and traffic, various pitfalls in cache usage need to be avoided in advance. By familiarizing yourself with these classic cache problems in advance, you can build defense measures in advance to avoid a large number of keys expiring simultaneously, avoid cache penetration caused by nonexistent keys, reduce the expiration of large keys and hot keys, and divert traffic for hot keys. You can take a series of measures to achieve caching with a high hit rate while maintaining data consistency. In addition, you can plan the SLA of the cache system, such as QPS, response distribution, average response time, etc., in advance according to the business model, implement monitoring, and facilitate operations and maintenance to respond in a timely manner. When encountering abnormal nodes or sudden traffic and extreme events, you can also avoid failures through pooling and layering strategies, key splitting, and other strategies.

In the end, you will be able to maintain high performance and high availability of the service in various complex scenarios, such as high concurrency, massive traffic, and network or machine hardware failures.