04 How to Handle Cache Invalidation, Penetration and Avalanche Problems

04 How to Handle Cache Invalidation, Penetration and Avalanche Problems #

Hello, I’m your cache teacher Chen Bo. Welcome to the 4th lesson on “Classic Issues Related to Cache Access”.

In the previous lessons, we talked about the principles and introduction of caching, as well as the design architecture. We summarized many techniques and key considerations in using and designing caching systems. In reality, there are many pitfalls in the design architecture of caching systems. If designed improperly, it can have serious consequences. Improper design can lead to slower requests, decreased performance, data inconsistency, reduced system availability, or even cache avalanches, where the entire system is unable to provide services.

Next, we will discuss the 7 classic issues in caching design, as shown in the figure below. We will describe the problems, analyze the reasons, provide examples of potential scenarios where these problems may occur in development, and finally give solutions to these classic issues. In this lesson, we will first learn about cache invalidation, cache penetration, and cache avalanche.

Cache Invalidation #

Problem Description #

The first classic issue with caching is cache invalidation. In the previous lesson, we mentioned that when a service system needs to retrieve data, it first checks the cache. If the data is not found in the cache, it then further retrieves it from the database (DB), stores it in the cache, and returns it. Caches typically have a performance advantage of 50 to 100 times higher than DBs, so we want data queries to hit the cache as much as possible to minimize system load and optimize performance. Data in the cache is stored and retrieved based on keys. When a large number of keys expire at the same time, many cache accesses will result in misses and further penetration to the DB. This significantly increases the load on the DB, leading to slower queries. This is the problem of cache invalidation.

Analysis of Reasons #

Cache invalidation, especially when many keys expire together, is closely related to the expiration time we set when writing the cache.

When writing the cache, we generally set an expiration time based on the access characteristics of the business. This expiration time is carried along when writing the cache, and the cache data will be evicted after this fixed expiration time. In normal cases, since cache data is written gradually, it is also expired and evicted gradually. However, in certain scenarios, a large amount of data is loaded from the DB to the cache by the system, either actively or passively. When these data are written to the cache using the same expiration time, they will all expire together after this expiration time, resulting in cache eviction. At this moment, all requests for this batch of data will result in cache invalidation, leading to penetration to the DB. Due to the large number of queries, the DB easily gets overwhelmed, causing slow requests.

Business Scenarios #

Many business scenarios can easily lead to a large number of cache invalidations, resulting in increased DB load and slower requests if not carefully handled. For example, when a batch of train tickets or flight tickets are available for sale, the system will load them into the cache in one go. If the expiration time of the cache is set based on a pre-determined value, then when the expiration time is reached, the system will experience a slowdown due to cache invalidation. There are many similar business scenarios, such as the microblogging service, which has a background offline system that continuously calculates popular microblogs. Once the calculation is completed, this batch of popular microblogs will be written to the corresponding cache in bulk. Additionally, many businesses perform cache warming when deploying new IDCs or launching new services, which involves loading a large amount of hot data at once.

Solution #

To address the issue of batch key cache invalidations, we can start with the fixed expiration time. When designing the expiration time for the cache, use the formula: expiration time = base time + random time. In other words, when writing the cache for the same business data, add a random expiration time on top of the base expiration time, so that the data gradually expires over a certain period of time, avoiding the instantaneous expiration of all data, which would cause excessive pressure on the DB, as shown in the figure below.

Cache Penetration #

Problem Description #

The second classic problem is cache penetration. Cache penetration is a very interesting problem. Because the probability of cache penetration is very low, it is generally difficult to be found. However, once you discover it, and the amount is not small, you may immediately experience a busy night. Because for normal access, even if the accessed data is not in the cache, it can be loaded back into the cache through the DB. However, cache penetration means that a special visitor is querying a non-existent key, causing each query to penetrate to the DB. If this special visitor also controls a batch of zombie machines and continues to access non-existent keys in your system, it will put a lot of pressure on the DB, thereby affecting normal service.

Reason Analysis #

The reason for cache penetration is that when designing the system, we tend to focus more on normal access paths and relatively lack consideration for special and abnormal access paths.

The normal path design for cache access is to first access the cache. After a cache miss occurs, the DB is queried. After the DB retrieves the result, it is loaded back into the cache and returned. This is not a problem for normal key access. However, if the user accesses a non-existent key and the DB returns null, this null will not be written back to the cache. No matter how many times this non-existent key is queried in the future, there will be cache misses and DB queries. The entire system will degrade into a “front-end + DB” system. Since the throughput of the DB is only 1%~2% of the cache, if there are special visitors who access these non-existent keys in large quantities, it will severely degrade the performance of the system and affect the access of normal users.

Business Scenario #

There are many business scenarios for cache penetration, such as accessing users through non-existent UID, and checking ticket booking information through non-existent train ID. It is not a problem if there are occasional requests with these problems, but if there are a large number of such requests, it will have a significant impact on the system.

Solution #

So how to solve this problem? As shown in the following figure.

The first solution is to record these non-existent data in the cache when querying the data for the first time. Although no result is found in the DB and NULL is returned, this key is still recorded in the cache, but the corresponding value of this key is set to a special value.
The second solution is to build a Bloom Filter cache filter to record all the data. In this way, when accessing the data, you can directly determine whether this key exists through the Bloom Filter. If it does not exist, you can return directly without querying the cache and DB.

However, there are still some pitfalls to be aware of when designing these two solutions.

For solution one, if special visitors continue to access a large number of non-existent keys, these keys will occupy a large amount of cache space, even if they only store a simple default value, which will reduce the hit rate of normal keys. Therefore, further improvement measures are, for these non-existent keys, to store them for a shorter time so that they expire as soon as possible, or to store these non-existent keys in a separate public cache. When looking up in the cache, first check the normal cache component, if it misses, then check the cache of illegal keys. If the latter hits, return directly; otherwise, go through the database. If the result is empty, store it in the cache of illegal keys; otherwise, store it in the normal cache.
For solution two, BloomFilter needs to cache the full set of keys, which requires a relatively small number of key, ideally within 1 billion, because 1 billion data will take up about 1.2GB of memory. BloomFilter can also cache illegal keys. Every time an illegal key is found to be non-existent, it is recorded in the BloomFilter. This recording method will cause the keys stored in the BloomFilter to keep growing rapidly. To avoid recording too many keys leading to an increase in false positive rate, it is necessary to regularly clear the BloomFilter.

BloomFilter #

BloomFilter is a very interesting data structure. It can not only block attacks with illegal keys but also judge a massive amount of data at low cost and high performance. For example, if a system has hundreds of millions of users and billions of news feeds, BloomFilter can be used to judge whether a user has read a certain news feed. The data structure of BloomFilter will be analyzed as shown in the following figure.

The purpose of BloomFilter is to determine whether an element exists in a set. It uses a bit array to represent a set. For a key, it is repeatedly checked by multiple different hash functions. If all the corresponding bit positions of the hashes are 1, it means that the key most likely exists, and on average, it only occupies 1.2 bytes per record, reaching 99%. If one of the bit positions of a hash is 0, it means that the key definitely does not exist in the set.

The algorithm of BloomFilter is to first allocate a block of memory space to create a bit array with all bit positions initialized as 0. When adding an element, k independent hash functions are used to compute the k positions where the element hashes to, and then all these k positions are set to 1. When checking a key, the same k hash functions are used to compute the k positions. If all the positions are 1, it means that the key exists; otherwise, it does not exist.

The advantages of BloomFilter are its high performance with full memory operations, extremely high space efficiency, and low false positive rate. To achieve a false positive rate of 1%, on average, it only takes 1.2 bytes per record. Moreover, increasing the average byte size by 0.6 bytes can further reduce the false positive rate to 1/10, that is, with an average of 1.8 bytes per record, the false positive rate can reach 1/1000; with 2.4 bytes per record, the false positive rate can be as low as 1/10000, and so on. Here, the false positive rate refers to the probability that BloomFilter judges a key to exist when it actually does not exist. This is because it stores the hash values of the keys rather than the values of the keys themselves. Therefore, there is a probability that different keys with different contents will produce the same hash values after multiple hash operations. As for the judgment of non-existent keys by BloomFilter, it is 100% certain that they do not exist. By contradiction, if a key exists, then each position corresponding to its hash value after each hash operation must be 1, not 0.

Cache Avalanche #

Problem Description #

The third classic problem is cache avalanche. During the operation of a system, cache avalanche is a very serious issue. Cache avalanche refers to the situation where some cache nodes become unavailable, leading to the unavailability of the entire cache system or even the service system. Cache avalanche can be divided into two cases according to whether the cache supports rehash (i.e., whether drifting occurs):

The system becomes unavailable due to the lack of support for rehash in the cache.
The cache becomes unavailable due to cache avalanche caused by rehash support.

Analysis of the reasons #

In the above two scenarios, the cache avalanche that occurs when the cache is not rehashed is generally due to a large number of cache nodes becoming unavailable, causing requests to penetrate and overload the DB, resulting in the entire system becoming unavailable. On the other hand, the cache avalanche that occurs when the cache supports rehashing is mostly related to the traffic peak. When the traffic peak arrives, it can cause some cache nodes to crash due to overload, and then the rehashing spreads to other cache nodes, resulting in the entire cache system becoming abnormal.

The first scenario is relatively easy to understand. When a significant number of cache nodes are unavailable and do not support rehashing, a large number of cache access requests will fail. According to the cache read/write model, these requests will further access the DB. The amount of requests that the DB can handle is much smaller than the cache, so if the request volume is too large, it can easily overload the DB, leading to a large number of slow queries, and ultimately blocking or crashing the system, causing service interruptions.

What about the second scenario? This is because when designing the cache distribution, many people will choose the consistent hashing distribution and use rehashing strategy when some nodes are abnormal, which evenly distributes the requests of the abnormal nodes to other cache nodes. In general, the combination of consistent hashing distribution and rehashing strategy works well. However, when a large traffic peak is approaching, if the large traffic keys are concentrated in one or two cache nodes, these cache nodes can easily become overloaded in terms of memory and network cards, causing the cache nodes to crash. Then these abnormal nodes go offline, and the requests for these large traffic keys will be rehashed to other cache nodes, causing other cache nodes to also become overloaded and crash. The cache anomalies will continue to spread, ultimately resulting in the entire cache system becoming abnormal and unable to provide external services.

Business scenarios #

Cache avalanches are not uncommon in business scenarios, and systems such as Weibo and Twitter encountered them many times in their early years of operation. For example, when Weibo initially used consistent hashing + rehashing strategy for many business caches, when a sudden flood of traffic arrived, some cache nodes became overloaded and crashed, resulting in their requests being redirected to other cache nodes, causing other cache nodes to also become overloaded and abnormal, ultimately resulting in overloading of the entire cache pool. In addition, if there is a power outage in the machine room, causing multiple cache nodes to go down, a large number of requests will directly hit the DB, causing the DB to become overloaded and blocked, resulting in abnormal system operation. The system gradually returns to normal after the cache machines are powered back on, the DB restarts, and the data is gradually warmed up.

Solutions #

To prevent cache avalanches, here are three solutions.

Solution 1: Add read and write switches to the business DB. When it is found that DB requests are slow and blocking, and the number of slow requests exceeds a certain threshold, the read switch will be turned off, and some or all of the requests for reading from the DB will fail fast and return immediately. Once the DB recovers, the read switch will be turned back on. The diagram below shows the process.

Solution 2: Add multiple copies of the cache. If the cache fails or there is a cache miss, other cache copies are read, and the multiple cache copies are deployed in different racks as much as possible to ensure that the cache system can provide services normally in any situation.
Solution 3: Monitor the cache system in real-time. When the ratio of slow requests exceeds a certain threshold, send timely alerts and take timely recovery actions through machine replacement, service replacement, etc. Various automatic failover strategies can also be used to automatically close abnormal interfaces, stop edge services, stop some non-core features, etc., to ensure the normal operation of core functions in extreme scenarios.

In fact, Weibo’s platform system has adopted all three solutions to avoid cache avalanches.