15 the Use of Cache Scenario Iii What to Do When Cache Penetration Occurs

15 The Use of Cache Scenario III - What to Do When Cache Penetration Occurs #

Hello, I am Tang Yang.

In the past three sessions, I have taken you deep into the world of caching. As you may know, the cache hit rate is crucial for the effectiveness of caching.

In a system with a low cache hit rate, a large number of requests for querying product information can bypass the cache and directly hit the database. This is because databases are relatively fragile when it comes to handling concurrent requests. Once the database cannot handle the heavy load of refreshing product pages, performing targeted searches for clothing information, etc., the queries will slow down, causing a large number of requests to be blocked at the database level. This can lead to the exhaustion of connection and thread resources on the application server, ultimately resulting in the collapse of your e-commerce system.

Generally speaking, we should aim for a cache hit rate of over 99% for core caches, and strive to maintain a hit rate of at least 90% for non-core caches. If the hit rate falls below these thresholds, it is an indication that you may need to optimize your caching strategies.

Since cache penetration can have such a significant impact, how can we reduce its occurrence? In this session, I will comprehensively explore the different measures we can take when facing cache penetration. However, before we delve into those solutions, you need to understand what cache penetration actually means. Only then can we better consider how to design solutions to address it.

What is cache penetration #

Cache penetration refers to the situation where data is not found in the cache and needs to be queried from the backend system (such as a database). You can think of the database as a mobile phone that cannot withstand too many scratches and bumps, so you need to apply a protective film and put on a protective case to protect the phone.

However, a small amount of cache penetration is unavoidable and does not harm the system, mainly for several reasons:

On one hand, internet systems often face the challenge of processing a huge amount of data, and the cache system has limited capacity, so it is not possible to store all the data from the system. Therefore, cache penetration occurs when querying uncached data.
On the other hand, the data access model of internet systems generally follows the “80/20 principle.” Also known as the Pareto principle, it is an economic theory proposed by Italian economist Pareto. It refers to the situation where in a group of things, the most important thing usually accounts for only 20%, while the remaining 80% of things are not important. When applying this principle to data access, it means that we often access 20% of hot data while the remaining 80% of data is rarely accessed. For example, you may have bought a lot of clothes and books, but in reality, you only wear and read a small portion of them.

Since the cache capacity is limited, and most accesses only request the 20% hot data, theoretically speaking, we only need to store the 20% hot data in the limited cache space to effectively protect the fragile backend system. Therefore, this small amount of cache penetration is unavoidable but harmless to the system.

So, what kind of cache penetration is harmful to the system? The answer is when a large number of penetration requests exceed the capacity of the backend system, causing it to crash. If we compare a small number of requests to a drizzle, then once it becomes a downpour that triggers a flood and collapses the house, it is definitely not acceptable.

There are many scenarios that can cause such a large number of penetration requests. Next, I will analyze these scenarios and provide corresponding solutions.

Solution to Cache Penetration #

Let’s consider a scenario: in your e-commerce system’s user table, we need to query user information by user ID, and the cache read/write strategy adopted is Cache Aside strategy.

So, what happens if we try to read an unregistered user from the user table? According to this strategy, we will first read from the cache, and then penetrate to read from the database. Since the user does not exist, neither the cache nor the database will return any data. Therefore, no data will be planted in the cache (which means setting a value in the cache). As a result, when requesting this user data again, it will still penetrate to the database. In this scenario, the cache cannot effectively prevent the request from penetrating to the database, so its role is minimal.

So, how do we solve the problem of cache penetration? Generally speaking, we have two solutions: seeding empty values and using Bloom Filters.

Let’s first look at the first solution.

Seeding Empty Values #

Reviewing the scenario mentioned above, you will find that the biggest problem is that the user’s data does not exist in the database, which means that no matter how many times you query, the user’s data will never exist in the database, and the penetration will always occur.

There are also some similar scenarios: For example, due to a bug in the code, an exception is thrown when querying the database. In this case, it can be considered that the data retrieved from the database is empty, and the cache will not be planted.

So, when we retrieve empty values from the database or encounter an exception, we can plant an empty value in the cache. However, since an empty value is not accurate business data and will occupy the space of the cache, we will give this empty value a relatively short expiration time, so that it can quickly expire and be eliminated within a short period of time. The following is the pseudocode of this process:

Object nullValue = new Object();
try {
  Object valueFromDB = getFromDB(uid); // query data from the database
  if (valueFromDB == null) {
    cache.set(uid, nullValue, 10);   // if an empty value is retrieved from the database, write the empty value to the cache with a shorter expiration time
  } else {
    cache.set(uid, valueFromDB, 1000);
  }
} catch(Exception e) {
  cache.set(uid, nullValue, 10);
}

Although seeding empty values can prevent a large number of penetration requests, if there are a large number of requests to retrieve information of unregistered users, the cache will have a large number of empty value caches, which will waste the storage space of the cache. If the cache space is full, it will even evict some cached user information, leading to a decrease in cache hit rate.

So for this solution, I suggest that you evaluate whether the cache capacity can support it when using it. If you need a large number of cache nodes to support it, then it is impossible to solve it by seeding empty values. In this case, you can consider using Bloom Filters instead.

Using Bloom Filters #

In 1970, Bloom proposed the Bloom Filter algorithm, which is used to determine whether an element is in a set. This algorithm consists of a binary array and a hash algorithm. Its basic idea is as follows:

We calculate the corresponding hash value for each value in the set using the provided hash algorithm, and then take the modulus of the hash value by the length of the array to get the index value that needs to be counted into the array. Then we change the value at that position in the array from 0 to 1. When judging whether an element exists in this set, you only need to calculate the index value of this element using the same algorithm. If the value at that position is 1, it is considered that this element is in the set; otherwise, it is considered not in the set.

The following figure is a schematic diagram of the Bloom Filter. Let’s analyze the information in the figure.

Elements A, B, C, and others form a set, and element D calculates a hash value that corresponds to a value of 1 in the array, so it can be considered that D is also in the set. However, F has a value of 0 in the array, so F is not in the array.

So how do we use Bloom Filters to solve the problem of cache penetration?

Let’s take the example of storing user information in a table to explain. First, we initialize a large array, for example, an array with a length of 2 billion. Next, we choose a hash algorithm. Then we calculate the hash value of all current user IDs and map them to this large array. We set the value of the mapped position to 1, and set other values to 0. The newly registered users not only need to be written into the database, but also need to update the value at the corresponding position in the array of the Bloom filter according to the same algorithm. When we need to query information about a user, we first check if the ID exists in the Bloom filter. If it does not exist, we directly return an empty value without the need to continue querying the database and cache. This can greatly reduce the cache penetration caused by exceptional queries.

The Bloom filter has high performance, both in terms of write operations and read operations, with a time complexity of O(1), which is a constant value. In terms of space, it also has a great advantage compared to other data structures. For example, a 2 billion element array requires 238M of space (2000000000/8/1024/1024), while using an array to store the data would require 7600M of space (2000000000 * 4 / 1024 / 1024), which is 32 times the size of the Bloom filter.

However, everything has two sides, and the Bloom filter is no exception. It has two main disadvantages:

There is a probability of error when determining whether an element is in the set, such as incorrectly classifying an element that is not in the set as being in the set.
It does not support element deletion.

Regarding the first disadvantage, it is mainly a problem with the hash algorithm. Because the Bloom filter is composed of a binary array and a hash algorithm, there is a certain probability of collision in the hash algorithm. Hash collision means that different input values produce the same hash result through the hash operation.

Originally, the meaning of hash is to map different inputs to unique fixed-length values based on different algorithms. For example, if I input the string “1” and use the CRC32 algorithm, the value is 2212294583. However, in reality, the input values for hash algorithms are infinite, while the output value space is fixed. For example, the value space of a 16-bit hash value is 65535. In this case, the collision probability is 1/65535, which means that if the number of input values exceeds 65535, collisions will definitely occur.

So you might ask why not map to a longer hash value?

Because longer hash values will bring higher storage and computation costs. Even if a 32-bit hash algorithm is used, the value space length is 2 to the power of 32 minus one, which is approximately 4.2 billion. To map 2 billion user data, the collision probability is still close to 50%.

Hash collision causes two user IDs, A and B, to calculate the same hash value. If A is a registered user and the value in the array corresponding to its hash value is 1, then even if B is not a registered user, its position in the array is the same as A’s, and the corresponding value is also 1, which results in a false positive.

The false positive of the Bloom filter has a characteristic, which is that it only occurs in “false positive” cases. What does this mean? When the Bloom filter determines that an element is in the set, that element may not actually be in the set. But once the Bloom filter determines that the element is not in the set, it is definitely not in the set. This is very suitable for solving the problem of cache penetration. Why is that? You see, if a Bloom filter determines that an element is not in the set, then we cannot be sure whether the element, which is determined by the Bloom filter to be not in the set, is actually in the set. In the scenario we just mentioned, if there are many requests to query information of unregistered users, even if the Bloom filter determines that they are not registered users, we cannot be certain that they are truly not registered users. In this case, we still need to query the database and cache, which renders the Bloom filter useless.

Therefore, although Bloom filters have the possibility of false positives, they can still reduce the occurrence of cache penetration. However, we need to minimize the probability of false positives so that the Bloom filter has a higher chance of making correct judgments and less cache penetration. One solution is to use multiple hash algorithms to calculate multiple hash values for each element. Only when all the corresponding array values of these hash values are 1, we consider the element to be in the set.

The drawback of Bloom filters not supporting element deletion is also related to hash collisions. For example, if two elements, A and B, are both in the set and they have the same hash value, they will be mapped to the same position in the array. If we delete A, the corresponding value in the array will change from 1 to 0. Then, when we want to check if B is in the set, we find that the value is 0, and we incorrectly determine that B is not in the set.

So how do we solve this problem? I will let the array store a count instead of just 0 or 1. For example, if A and B both map to the same index of the array, the value at that position will be 2. If we delete A, we change the value from 2 to 1. In this solution, the array no longer stores bits but values, which increases the space consumption. Therefore, you need to choose whether to use a Bloom filter based on your specific business scenario. For example, in scenarios like registered users, where user deletion is unlikely to occur, a Bloom filter can still be used to solve the cache penetration problem.

Now that we have discussed Bloom filter usage, I will give you some recommendations:

Choose multiple hash functions to calculate multiple hash values to reduce the probability of false positives.
Bloom filters consume a certain amount of memory, so you need to evaluate how much memory is needed in your business scenario and whether the storage cost is acceptable.

In summary, having a null object and using a Bloom filter are the two main solutions to solve the cache penetration problem. However, they have their own applicability and cannot solve all problems. For example, when there is a highly hot cache item that, once expired, will result in a large number of requests penetrating the database, causing immense pressure on the database. We call this scenario the “dog-pile effect”.

How do we solve this problem? The idea is to reduce the concurrency after cache penetration as much as possible. The solution is relatively simple:

In the code, control that after a hot cache item expires, start a background thread to penetrate the database and load the data into the cache. Before the cache is loaded, all requests accessing the cache will no longer penetrate and instead return directly.
Use a distributed lock in Memcached or Redis, allowing only the requests that acquire the lock to penetrate the database.

The distributed lock approach is also relatively simple. For example, if user ID 1 is a hot user, when the cache of their user information expires and we need to reload the data from the database, we first write a cache item with a key of “lock.1” to Memcached. Then, we load the data from the database. After the data is loaded, we delete this key. At this point, if another thread wants to request the user’s data and finds that there is a cache item with a key of “lock.1” in the cache, it knows that another thread is currently loading the value from the database into the cache, so it can retrieve the data from the cache again without penetrating the database.

Lesson Summary #

In this class, I introduced you to some solutions for solving cache penetration. You can gain some insights when your cache hit rate drops. I want to emphasize the following points:

Seeding empty values is the most common and simplest solution. If the space occupied by empty value cache is acceptable, you should prioritize using this approach.
Bloom filters introduce a new component and add complexity to development and operational costs. Therefore, they should only be used in scenarios where there are massive queries to a database with nonexistent data. When using bloom filters, you should also pay attention to the memory consumption.
For the “dogpile effect” caused by extremely hot cache data penetration, it can be solved by setting distributed locks or using background threads for periodic loading.

Furthermore, you need to understand that the database is a fragile resource. It is inferior to cache in terms of scalability, performance, and concurrency. Therefore, the core goal of solving cache penetration issues is to reduce concurrent requests to the database. Understanding this core concept may help you find better solutions to cache penetration problems in your daily work.