26 Cache Anomaly How to Resolve Cache Avalanche and Penetration Issues

26 Cache Anomaly How to Resolve Cache Avalanche and Penetration Issues #

In the previous lesson, we learned about the data inconsistency problem between the cache and the database, as well as the methods to address it. Besides the data inconsistency problem, we often face three other cache issues: cache avalanche, cache breakdown, and cache penetration. Once these three issues occur, a large number of requests will pile up at the database layer. If the concurrency level of the requests is high, it can lead to database overload or failure, which is a serious production incident.

In this lesson, I will discuss the manifestations, causes, and solutions to these three issues. As the saying goes, “Know yourself, know your enemy, and you shall not be defeated in a hundred battles.” By understanding the causes of these problems, we can make reasonable cache settings when using Redis cache, as well as corresponding frontend settings in our business applications, in order to be prepared in advance.

Next, let’s first take a look at the issue of cache avalanche and its corresponding solution.

Cache Avalanche #

Cache avalanche refers to a situation where a large number of application requests cannot be processed in the Redis cache, resulting in a surge in pressure on the database layer as the application sends a large number of requests to the database.

Cache avalanche is generally caused by two reasons, and there are different solutions for each. Let’s take a look at them one by one.

The first reason is that a large number of data in the cache expire at the same time, causing a large number of requests to be unable to be processed.

Specifically, when data is stored in the cache and an expiration time is set, if a large number of data expire at the same moment, the cache will be missing when the application accesses these data. As a result, the application will send requests to the database to retrieve the data. If the application has a high concurrent request volume, the pressure on the database will be high, further affecting the processing of other normal business requests in the database. Let’s look at a simple example as shown below:

To address the cache avalanche problem caused by a large number of data expiring simultaneously, I will provide you with two solutions.

Firstly, we can avoid setting the same expiration time for a large number of data. If it is indeed required for some data to expire at the same time, you can add a small random number (e.g., randomly increase by 1-3 minutes) to the expiration time of each data when using the EXPIRE command to set the expiration time. This way, the expiration times of different data will differ slightly, but not too much. This avoids a large number of data expiring simultaneously while still meeting the business requirements.

In addition to fine-tuning the expiration time, we can also use service degradation to cope with cache avalanche.

Service degradation refers to adopting different processing methods for different data when a cache avalanche occurs.

When the business application accesses non-core data (such as e-commerce product attributes), temporarily stop querying these data from the cache and instead return predefined information, empty values, or error messages.
When the business application accesses core data (such as e-commerce product inventory), still allow querying the cache. If the cache is missing, we can continue to retrieve it from the database.

This way, only requests for a portion of expired data will be sent to the database, reducing the pressure on the database. The image below shows the execution of data requests during service degradation, you can take a look.

In addition to a large number of data expiring simultaneously causing a cache avalanche, there is another situation that can lead to a cache avalanche, which is the inability to process requests. This will result in a sudden backlog of a large number of requests at the database layer, leading to a cache avalanche. *** ** Redis cache instances may fail and crash.** *** Generally speaking, a Redis instance can handle tens of thousands of request processing throughput, while a single database can only handle thousands of request processing throughput. Their processing capacity may differ by nearly ten times. Due to cache avalanche and Redis cache expiration, the database may have to bear nearly ten times the request pressure, leading to a crash due to excessive pressure.

At this time, when the Redis instance is down, we need to take other measures to deal with cache avalanche. I have two suggestions for you.

The first suggestion is to implement service degradation or request rate limiting mechanisms in the business system.

Service degradation refers to pausing the business application’s access to the cache system in order to prevent a chain reaction of database avalanche, or even a system collapse, when a cache avalanche occurs. Specifically, when the business application calls the cache interface, the cache client does not send the request to the Redis cache instance but returns directly. When the Redis cache instance resumes service, the application requests are allowed to be sent to the cache system.

By doing this, we avoid a large number of requests being accumulated in the database system due to cache missing, ensuring the normal operation of the database system.

During the operation of the business system, we can monitor the load indicators of the Redis cache and the database, such as requests per second, CPU utilization, memory utilization, etc. If we find that the Redis cache instance is down and the load pressure on the machine where the database is located suddenly increases (e.g., a surge in requests per second), a cache avalanche occurs. A large number of requests are sent to the database for processing. We can start the service degradation mechanism to pause the business application’s access to the cache service, thereby reducing the access pressure on the database, as shown in the following figure:

Although service degradation ensures the normal operation of the database, it suspends access to the entire cache system and has a large impact on the business application. To minimize this impact, we can also implement request rate limiting. Request rate limiting refers to controlling the number of requests entering the system per second at the entry point of the business system, to avoid too many requests being sent to the database.

Let me give you an example. Suppose that under normal operation of the business system, the entry point allows 10,000 requests to enter the system per second, of which 9,000 requests can be processed in the cache system and only 1,000 requests will be sent to the database for processing.

Once a cache avalanche occurs and the number of requests per second to the database suddenly increases to 10,000, we can start the request rate limiting mechanism and allow only 1,000 requests to enter the system per second at the entry point, rejecting any additional requests beyond that. Therefore, by using request rate limiting, we can avoid transferring a large amount of concurrent request pressure to the database layer.

Using service degradation or request rate limiting mechanisms to deal with cache avalanche caused by Redis instance downtime is a case of “hindsight is 20/20.” In other words, these two mechanisms are used to reduce the impact of an avalanche on the database and the entire business system after the avalanche has occurred.

My second suggestion to you is prevention.

Build a high-reliability Redis cache cluster using the master-slave node approach. If the main node of the Redis cache fails and goes down, the slave node can be switched to become the main node and continue to provide cache services, avoiding cache avalanche caused by the cache instance going down.

Cache avalanche occurs when a large amount of data expires simultaneously. In contrast, cache penetration, which I will introduce to you next, occurs when a hot data expires. Compared to cache avalanche, the number of data failures in cache penetration is much smaller, and the corresponding countermeasures are different. Let’s take a look.

Cache Breakdown #

Cache breakdown refers to the situation where requests for hot data that is accessed very frequently cannot be processed in the cache. As a result, a large number of requests to access this data are sent to the backend database, causing a sudden increase in database pressure and affecting the processing of other requests. Cache breakdown often occurs when hot data expires, as shown in the figure below:

To avoid the tremendous pressure on the database caused by cache breakdown, our solution is relatively straightforward. We do not set an expiration time for hot data that is accessed very frequently. This way, requests for hot data can all be processed in the cache, and Redis’s high throughput of tens of thousands of requests per second can handle a large number of concurrent requests.

So far, you have learned about the issues of cache avalanche and cache breakdown, as well as their solutions. When cache avalanche or breakdown occurs, the data that the application needs to access is still stored in the database. Next, I will introduce the issue of cache penetration, which is different from avalanche and breakdown. When cache penetration occurs, the data is not in the database either, which puts both the cache and the database under pressure. So what should we do? Let’s take a closer look.

Cache Penetration #

Cache penetration refers to the situation where the data to be accessed is neither in the Redis cache nor in the database, resulting in cache misses when accessing the cache, and then finding that the database also does not have the data to be accessed. At this point, the application is unable to read the data from the database and write it into the cache to serve subsequent requests. As a result, the cache becomes a “decoration”. If there are a large number of requests continuously accessing data, it will bring tremendous pressure to both the cache and the database, as shown in the figure below:

So, when does cache penetration occur? Generally speaking, there are two situations.

Mistaken operation at the business layer: Data in the cache and the database is mistakenly deleted, so both the cache and the database do not have the data.
Malicious attacks: Specifically accessing data that does not exist in the database.

To avoid the impact of cache penetration, I will provide you with three response strategies.

The first strategy is to cache empty values or default values.

Once cache penetration occurs, we can cache an empty value or a default value (such as 0 for inventory) in Redis for the queried data. Then, when subsequent requests are sent by the application for querying, the empty value or default value can be directly read from Redis and returned to the business application, avoiding sending a large number of requests to the database, and maintaining the normal operation of the database.

The second strategy is to use a Bloom filter to quickly determine if the data exists, avoiding querying the existence of data from the database and reducing the burden on the database.

Let’s first look at how a Bloom filter works.

A Bloom filter consists of a bit array initialized to 0 and N hash functions. It can be used to quickly determine whether a specific data exists. When we want to mark the existence of certain data (such as when the data has been written into the database), the Bloom filter completes the marking process through three operations:

First, use N hash functions to calculate the hash values of the data, obtaining N hash values.
Then, take the modulo of the bit array length with these N hash values, obtaining the corresponding positions of each hash value in the array.
Finally, set the corresponding bit positions in the array to 1, completing the data marking operation in the Bloom filter.

If the data does not exist (for example, the data has not been written into the database), the Bloom filter has not marked the data, so the values of the bit positions corresponding to the bit array are still 0.

When querying a certain data, we perform the aforementioned calculation process, first obtaining the N corresponding positions of the data in the bit array. Then, we check the bit values in these N positions of the bit array. As long as one of these N bit values is not 1, it indicates that the Bloom filter has not marked the data, so the queried data definitely does not exist in the database. To help you understand this, I have drawn a diagram which you can take a look at.

In the diagram, the Bloom filter is an array of 10 bits and uses 3 hash functions. When marking data X in the Bloom filter, X is hashed three times and taken modulo by 10, and the modulo results are 1, 3, and 7. Therefore, the 1st, 3rd, and 7th bits in the bit array are set to 1. When the application wants to query X, it only needs to check whether the 1st, 3rd, and 7th bits in the array are 1. As long as one of them is 0, it indicates that X definitely does not exist in the database.

Based on the fast detection feature of the Bloom filter, we can use it to mark data when writing it into the database. When there is a cache miss and the application queries the database, it can quickly determine if the data exists by querying the Bloom filter. If it does not exist, there is no need to query the database again. In this way, even if cache penetration occurs, a large number of requests will only query Redis and the Bloom filter, instead of building up in the database, thereby not affecting the normal operation of the database. The Bloom filter can be implemented using Redis itself and can withstand significant concurrent access pressure.

The final strategy is to perform request detection at the front end of the request entry. One of the causes of cache penetration is a large number of malicious requests accessing nonexistent data. Therefore, an effective response strategy is to perform the validity check on the requests received by the business system at the front end of the request entry, filter out malicious requests (such as requests with unreasonable parameters, requests with illegal parameter values, or requests with non-existing fields), and prevent them from accessing the backend cache and database. In this way, the problem of cache penetration will not occur.

Compared with cache avalanche and cache breakdown, cache penetration has a greater impact, so I hope you can pay special attention to it. From the prevention perspective, we need to avoid mistakenly deleting data from the database and cache. From the response perspective, we can use caching empty values or default values, use Bloom filters, and perform detection on malicious requests in the business system, among other methods.

Summary #

In this lesson, we have learned about three types of abnormal cache issues: cache avalanche, cache breakdown, and cache penetration. From the causes of the problems, cache avalanche and cache breakdown occur mainly when the data is not in the cache, while cache penetration occurs when the data is neither in the cache nor in the database. Therefore, when cache avalanche or cache breakdown occurs, once the data in the database is written back to the cache, the application can quickly access the data in the cache again, and the database pressure will be reduced accordingly. On the other hand, when cache penetration occurs, both the Redis cache and the database will continue to bear the request pressure.

To help you better understand, I have summarized the causes and countermeasures of these three major issues in a table for you to review.

缓存问题总结表

Finally, I would like to emphasize that methods such as service circuit breaking, service degradation, and request throttling are all “degrading” solutions, which can ensure the stability of the database and the overall system but may have negative impacts on business applications. For example, when using service degradation, some data requests can only receive error response messages and cannot be processed normally. If service circuit breaking is used, the entire cache system’s services will be suspended, affecting a larger scope of business. Moreover, using request throttling mechanism will reduce the throughput of the entire business system and decrease the number of concurrent user requests, which will affect user experience.

Therefore, my recommendation to you is to use preventive solutions as much as possible:

For cache avalanche, set reasonable data expiration times and build highly reliable cache clusters.
For cache breakdown, when accessing frequently accessed hot data in the cache, do not set expiration times.
For cache penetration, implement malicious request detection in the frontend or standardize database data deletion operations to avoid accidental deletions.

One Question per Lesson #

As usual, I have a small question for you. When discussing cache avalanche, I mentioned that we can use service fusing, service degradation, and request limiting as methods to cope with it. Please take a moment to think, can these three mechanisms be used to address cache penetration?

Feel free to write down your thoughts and answers in the comment section. Let’s exchange and discuss together. If you find today’s content helpful, feel free to share it with your friends or colleagues. See you in the next lesson.