27 Cache Contamination What to Do

27 Cache Contamination What to Do #

When using Redis cache, caching data that will be accessed repeatedly can accelerate the access to our business applications. However, if cache pollution occurs, the acceleration effect of cache on our business applications will be reduced.

So, what is cache pollution? In some scenarios, some data is accessed very infrequently, or even only once. After serving these data access requests, if they continue to remain in the cache, they will only occupy cache space without being useful. This condition is called cache pollution.

When cache pollution is not severe, only a small amount of data occupies cache space, and the impact on the cache system is not significant. However, once cache pollution becomes severe, a large amount of data that is no longer accessed will be left in the cache. If the cache space is filled with this data, when we write new data to the cache, we need to gradually evict these data from the cache. This will introduce an additional operational time overhead and subsequently affect the performance of our applications.

Today, let’s take a look at how to solve the cache pollution problem.

How to Solve Cache Pollution Problems? #

To solve cache pollution, it is easy to come up with a solution, which is to filter out and eliminate data that will no longer be accessed. This way, we don’t have to wait until the cache is full and then eliminate old data one by one before we can write new data. The data that can be kept in the cache is determined by the cache eviction policy.

By now, do you still remember the 8 data eviction policies we learned together in [Lesson 24]? They are: noeviction, volatile-random, volatile-ttl, volatile-lru, volatile-lfu, allkeys-lru, allkeys-random, and allkeys-lfu.

In these 8 policies, the noeviction policy does not perform any data eviction. So, it certainly cannot be used to solve cache pollution problems. The other 7 policies will evict data according to certain rules. There is a keyword here, “certain rules,” so the question is, do different rules all work to solve cache pollution problems? Next, let’s analyze them one by one.

Since the LRU algorithm is widely used in cache data eviction strategies, let’s first analyze the other strategies, then analyze the cases where the eviction strategy uses the LRU algorithm separately, and finally learn about the countermeasures for cache pollution when the LFU algorithm is used for the eviction strategy. There are two strategies that use the LRU algorithm and the LFU algorithm respectively (volatile-lru and allkeys-lru, as well as volatile-lfu and allkeys-lfu). For ease of understanding, I will refer to them as the LRU strategy and the LFU strategy.

First, let’s look at volatile-random and allkeys-random. Both of these strategies select data to be evicted randomly.

Since it is random selection, Redis does not select data based on its access status. If the evicted data is accessed again, cache misses will occur. That is, the application needs to access this data from the backend database, reducing the response speed of the application. Therefore, the effects of volatile-random and allkeys-random strategies on avoiding cache pollution are very limited.

Let me give you an example. As shown in the figure below, let’s assume we configure Redis cache to use the allkeys-random eviction strategy. When the cache is full, the allkeys-random strategy randomly selects data 20 for eviction. Unfortunately, data 20 is accessed again soon after, causing a cache miss in Redis.

Let’s continue to analyze whether the volatile-ttl strategy can effectively handle cache pollution. The volatile-ttl strategy targets data that has expiration times set and selects and evicts data with the shortest remaining time to live.

Although the volatile-ttl strategy no longer randomly selects data for eviction, the remaining time to live does not directly reflect the data’s access again. Therefore, evicting data according to the volatile-ttl strategy, similar to evicting data randomly, may also cause cache misses due to the data being evicted and accessed again.

At this point, you may think of an exception: when the business application sets an expiration time for data, it explicitly knows that the data will be accessed again and sets the expiration time based on the access situation. At this time, Redis can select data that will no longer be accessed according to the shortest remaining time to live and avoid cache pollution. For example, the business department knows that the data will be accessed for one hour and sets the expiration time as one hour later. In this case, the data that is evicted will indeed not be accessed again.

Now, let’s summarize. Except when the data being accessed again is explicitly known, volatile-ttl can effectively avoid cache pollution. In other cases, volatile-random, allkeys-random, and volatile-ttl do not solve the cache pollution problem effectively.

Next, let’s analyze the LRU strategy and the LFU strategy implemented in Redis 4.0 separately. The LRU strategy selects data to be evicted based on data access recency and is widely used. We have already learned how Redis implements the LRU strategy in Lesson 24, so let’s focus on its effectiveness in solving cache pollution problems.

LRU Cache Strategy #

Let’s review the core idea of the LRU (Least Recently Used) strategy: if a piece of data has just been accessed, it is definitely hot data and will be accessed again.

Based on this core idea, in Redis, the LRU strategy sets an “lru” field in the RedisObject structure corresponding to each data, to record the access timestamp of the data. When evicting data, the LRU strategy will evict the data with the smallest value of the “lru” field (i.e., the data that has been accessed longest) from the candidate dataset.

Therefore, in business scenarios where data is frequently accessed, the LRU strategy can effectively retain the most recently accessed data. Moreover, because these retained data will be accessed again, it can improve the access speed of business applications.

However, because only the access time of data is considered, the LRU strategy cannot solve the cache pollution problem when handling scan-style single-query operations. The so-called scan-style single-query operation refers to when an application reads a large amount of data in one read operation, and each data is read only once. At this time, because these queried data have just been accessed, the “lru” field values of these data are all very large.

When evicting data using the LRU strategy, these data will remain in the cache for a long time, causing cache pollution. If the amount of data queried is large and these data fill up the cache space but will not serve new cache requests, then if new data needs to be written to the cache, the old data needs to be replaced from the cache first, which will affect the cache performance.

To help you understand, let me give you an example. As shown in the following figure, after data 6 is accessed, it is written to the Redis cache. However, after that, data 6 has not been accessed again, which causes data 6 to linger in the cache, resulting in pollution.

Therefore, for Redis caches that adopt the LRU strategy, scan-style single-query operations can cause cache pollution. To address this cache pollution problem, Redis introduced the LFU (Least Frequently Used) eviction strategy starting from version 4.0.

Compared to the LRU strategy, the LFU strategy selects and evicts data from two dimensions: 1) the timeliness of data access (how long ago the access occurred), and 2) the number of times the data has been accessed.

So how does Redis implement the LFU strategy and solve the cache pollution problem? Let’s take a look.

Optimization of LFU Cache Policy #

The LFU cache policy is built upon the LRU policy and adds a counter for each data entry to track the number of times the data has been accessed. When selecting entries to evict using the LFU policy, it first filters based on the access count, evicting the entry with the lowest access count. If two entries have the same access count, the LFU policy then compares their access recency, evicting the entry that has not been accessed for a longer time.

Compared to data that is frequently accessed, data accessed only once in a scan-like query will not have their access count increased. Therefore, the LFU policy prioritizes evicting these low-access count entries, which helps prevent them from polluting the cache.

So, how is the LFU policy actually implemented? Since the LFU policy is an optimization of the LRU policy, their implementations are related. Let’s review the implementation of the LRU policy discussed in Lesson 24.

To avoid the overhead of manipulating linked lists, Redis employs two approximate methods in implementing the LRU policy:

  • Redis uses the RedisObject structure to store data, and this structure includes an lru field to record the access timestamp of the data.
  • Rather than maintaining a global linked list for all data, Redis randomly samples a certain number (e.g., 10) of data entries and puts them into a candidate set. Subsequently, the candidate set is filtered based on the value of the lru field.

Based on this, when implementing the LFU policy, Redis further splits the original 24-bit lru field into two parts.

  1. ldt value: The first 16 bits of the lru field, representing the access timestamp of the data.
  2. counter value: The last 8 bits of the lru field, representing the access count of the data.

To summarize: when filtering data with the LFU policy, Redis selects the entry with the least access count based on the last 8 bits of the data’s lru field in the candidate set for eviction. If the access counts are the same, it then selects the entry with the oldest access timestamp based on the first 16 bits of the lru field.

But wait, Redis only uses 8 bits to record the access count, and the maximum value that can be recorded with 8 bits is 255. Is this sufficient?

In practice, a data entry may be accessed thousands or even tens of thousands of times. If the counter value is incremented by 1 for each access, once the access count exceeds 255, all data entries will have the same counter value. In the process of data eviction, the LFU policy cannot effectively distinguish and filter these data entries, and it might even leave less frequently accessed data in the cache.

Let’s look at an example.

Consider the first data entry, A, which has an accumulated access count of 256 and an access timestamp of 202010010909. Its counter value is 255. On the other hand, the second data entry, B, has an accumulated access count of 1024 and an access timestamp of 202010010810. If the counter value can only be recorded up to 255, then the counter value of data entry B will also be 255. At this time, the cache is full and Redis uses the LFU (Least Frequently Used) strategy for eviction. The counter values for data A and B are both 255. When comparing A and B’s access timestamps, LFU strategy determines that B’s last access time is earlier than A’s, so B will be evicted. However, in reality, data B is accessed much more frequently than data A and is likely to be accessed again. Therefore, using the LFU strategy to evict data is not appropriate.

Indeed, Redis recognizes this issue. Therefore, when implementing the LFU strategy, Redis does not simply increment the counter value by 1 each time the data is accessed. Instead, it adopts a more optimized counting rule.

In simple terms, the counting rule implemented by the LFU strategy is as follows: every time the data is accessed, first, multiply the current value of the counter by the configuration option lfu_log_factor, then add 1 and take the reciprocal, resulting in a value p. Next, compare this p value with a random value r between 0 and 1. Only when p is greater than r, the counter is incremented by 1.

The following part of Redis source code demonstrates the calculation logic for incrementing the counter value using the LFU strategy. Here, baseval represents the current value of the counter. The initial value of the counter is set to 5 by default (configured by the constant LFU_INIT_VAL in the code), instead of 0. This prevents data from being immediately evicted due to low access frequency right after being written to the cache.

double r = (double)rand()/RAND_MAX;
...
double p = 1.0/(baseval*server.lfu_log_factor+1);
if (r < p) counter++;   

By using this calculation rule, we can control the speed at which the counter value increases by setting different values for the lfu_log_factor configuration option. This allows us to avoid the counter value quickly reaching 255.

To further illustrate the effect of the LFU strategy’s counter increment, you can refer to the following table. This table, provided on the Redis website, shows how the counter value changes under different actual access frequency conditions when lfu_log_factor takes different values.

As can be seen, when lfu_log_factor is set to 1, the counter value reaches 255 after an actual access frequency of 100K, making it unable to differentiate data with higher access frequency. However, when lfu_log_factor is set to 100, the counter value only reaches 255 after an actual access frequency of 10M. At this point, different data with actual access frequency lower than 10M can still be distinguished based on the counter value.

Due to the use of nonlinear increment for the counter, even when the access frequency of cached data is in the thousands or tens of thousands, the LFU strategy can effectively differentiate different access frequencies and perform reasonable data selection. From the table above, we can observe that when lfu_log_factor is set to 10, there is already a noticeable differentiation in the counter values for access frequencies at the levels of a hundred, thousand, and tens of thousands. Therefore, when applying the LFU strategy, it is generally recommended to set lfu_log_factor to 10.

As mentioned earlier, the load of the application is complex. In some scenarios, some data may be heavily accessed for a short period and then never accessed again. If we filter data based on access frequency alone, these data will remain in the cache without improving cache hit rate. To address this, Redis has designed a decay mechanism for the counter value when implementing the LFU strategy.

In simple terms, the LFU strategy uses the decay factor configuration option lfu_decay_time to control the decay of access frequency. The LFU strategy calculates the difference between the current time and the timestamp of the most recent access to the data, and converts this difference into minutes. Then, the LFU strategy divides this difference by the lfu_decay_time value to obtain the value by which the data’s counter needs to be decayed.

To give an example, assuming lfu_decay_time is set to 1, if the data has not been accessed for N minutes, its access frequency will be reduced by N. If lfu_decay_time takes a larger value, the corresponding decay value will be smaller, resulting in weaker decay effect. Therefore, if the application involves data with short-term high-frequency access, it is recommended to set lfu_decay_time to 1. This way, the LFU strategy will decay the access frequency of these data more quickly after they are no longer accessed, and evict them from the cache as soon as possible to avoid cache pollution.

Summary #

In today’s lesson, we learned about the problem of “how to solve cache pollution”.

Cache pollution refers to data that remains in the cache but will not be accessed again, yet still occupies cache space. If the volume of such data is large and even fills up the cache, each time new data is written to the cache, these data need to be gradually eliminated from the cache, which increases the time overhead of cache operations.

Therefore, the key technical point to solve the cache pollution problem is to be able to identify these rarely accessed data or data that is only accessed once, and prioritize them for elimination when eliminating data. Since the noviction strategy does not involve data eviction, in this lesson, we analyzed the other 7 data eviction strategies of Redis from the dimension of whether they can effectively solve cache pollution.

volatile-random and allkeys-random randomly select data for eviction, and cannot filter out data that is no longer accessed, which may cause cache pollution. If the business layer knows the access duration of the data, it can set a reasonable expiration time for the data and use the volatile-ttl strategy of Redis cache. When the cache is full, the data with the shortest remaining survival time will be evicted from the cache to avoid lingering in the cache and causing pollution.

When we use the LRU strategy, it cannot quickly filter out data that is only accessed once because the LRU strategy only considers the access time of the data. The LFU strategy optimizes the LRU strategy by first filtering and evicting data with fewer access times, and then filtering and evicting the data with the longest access time among the data with the same access times.

In terms of implementation, Redis splits the original 24-bit lru field into a 16-bit ldt and an 8-bit counter for representing the access timestamp and access count of the data, respectively, compared to the LRU strategy. In order to avoid the limit of the maximum value of 255 given by 8 bits, the LFU strategy uses a non-linear count to represent the access count of the data.

In actual business applications, both the LRU and LFU strategies are used. The LRU and LFU strategies focus on different data access characteristics. The LRU strategy focuses more on the timeliness of the data, while the LFU strategy focuses more on the access frequency of the data. In general, real-world loads have good temporal locality, so the application of the LRU strategy is more widespread. However, in scan query application scenarios, the LFU strategy can cope with cache pollution problems well, so it is recommended to use it first.

In addition, if there is data with short-term high-frequency access in the business application, besides the self-decay of the access count by the LFU strategy itself, I have another suggestion for you: you can use the volatile-lfu strategy first and set their expiration time based on the access time limit of these data to prevent them from causing pollution by lingering in the cache.

One Question per Lesson #

As usual, I have a little question for you. Do you think the cache can still be contaminated after using the LFU strategy?

Please write down your thoughts and answers in the comments section. Let’s discuss and communicate together. If you find today’s content helpful, feel free to share it with your friends or colleagues. See you in the next class.