24 Replacement Strategy What to Do When Cache Is Full

24 Replacement Strategy What to Do When Cache is Full #

Redis cache uses memory to store data, avoiding the need for the business application to read data from the backend database, which can improve the application’s response speed. So, is it a good design choice to put all the data to be accessed into the cache? In fact, this approach is not cost-effective.

Let’s take an example. If there is 1TB of data in MySQL, and we use Redis to cache all this 1TB of data, although the application can access the data in memory, this configuration is not reasonable because it is not cost-effective. On the one hand, the price of 1TB of memory is about 35,000 yuan, while the price of 1TB of disk is about 1,000 yuan. On the other hand, data access has locality, which means that typically 80% of the requests only access 20% of the data. Therefore, it is not necessary to use 1TB of memory for caching.

To ensure a higher cost-effectiveness, the capacity of the cache must be smaller than the total amount of data in the backend database. However, memory size is limited, and as the amount of data to be cached increases, the limited cache space will inevitably be filled. So what should be done in this case?

Solving this problem involves an important mechanism of the cache system, namely the cache replacement mechanism. Simply put, the data replacement mechanism consists of two steps: first, selecting “less important” data for application access according to certain policies; second, removing these data from the cache to make space for new data.

In this lesson, I will talk to you about the data replacement mechanism when the cache is full. Usually, we also call it the cache eviction mechanism, and we will also discuss a series of specific strategies for selecting data to be evicted. By understanding the data eviction mechanism and the corresponding strategies, we can choose a reasonable Redis configuration, improve cache hit rate, and enhance the application’s access performance.

However, before learning about eviction strategies, we first need to know the basis and methods for setting the cache capacity. After all, when using a cache in practice, we need to decide how much space to use for caching data.

What is the appropriate cache capacity? #

The rationality of cache capacity directly affects the cost-effectiveness of using caches. We usually hope to achieve the maximum benefit at the minimum cost, so it is very important to use expensive memory resources in critical areas.

As I just mentioned, the data access in actual applications has locality. The chart below shows the access volume of different proportions of data contribution, represented by the blue and red lines. The blue line represents the data locality represented by the “80/20 rule”, while the red line represents the variation of data locality under the current application load.

Let’s first look at the blue line. It represents the “80/20 rule”, where 20% of the data contributes to 80% of the access, while the remaining data, although large in volume, only contributes to 20% of the access. This 80% of data forms a long tail in terms of access volume, which we also call the “long tail effect”.

Therefore, if we set the cache space capacity to 20% of the total data volume according to the “80/20 rule”, we may intercept 80% of the access.

Why is it said to be “possible”? This is because the “80/20 rule” is a statistically significant ratio of data volume to access volume obtained by statistics on data access in many actual applications. In specific applications, the data access patterns will be related to specific business scenarios. For the most frequently accessed 20% of the data, their contribution to the access volume may either exceed 80% or be less than 80%.

Let’s use an example of an e-commerce product to explain this “possibility”. On the one hand, during product promotion, the information of popular products may account for only 5% of the total product data information, but these product information may bear more than 90% of the access requests. In this case, caching only these 5% of data can achieve good performance benefits. On the other hand, if the business application needs to query and analyze all product information, even if 20% of the product data is cached according to the “80/20 rule”, good access performance cannot be achieved because 80% of the data still needs to be retrieved from the backend database.

Next, let’s take a look at the red line in the chart of data access locality. In recent years, some researchers have specially analyzed the distribution of user requests for content in Internet applications (such as video streaming websites), and obtained the red line in this chart.

On this red line, 80% of the data contributes to the access volume, which exceeds the access volume that 80% of the data can contribute in the traditional long tail effect. The reason is that there are more and more personalized requirements from users. In a business application, different users may have significantly different content access patterns. Therefore, the proportion between the data requested by users and the access volume they contribute no longer has the distribution characteristics of the “80/20 rule” in the long tail effect. In other words, 20% of the data may not contribute to 80% of the access, while the remaining 80% of the data may actually contribute more access volume, which we call the heavy tail effect.

Because 20% of the data may not contribute to 80% of the access volume, we cannot simply set the maximum cache capacity to “20% of the total data volume”. In the practical process, I have seen cache capacity ratios ranging from 5% to 40% of the total data volume. This capacity planning cannot be generalized and needs to be considered based on the actual access characteristics of the application data and the cost overheads.

This is actually the experience I have been sharing with you. The design choices for a system are a process of balancing: a large cache capacity can bring performance acceleration benefits, but it also comes with higher costs, while a small cache capacity may not necessarily accelerate access. Generally speaking, I would recommend setting the cache capacity to 15% to 30% of the total data volume, considering both access performance and memory space overhead.

For Redis, once the maximum cache capacity is determined, such as 4GB, you can use the following command to set the cache size:

CONFIG SET maxmemory 4gb

However, it is inevitable that the cache will be filled. Even if you carefully select and determine the cache capacity, you still need to deal with the replacement operation when the cache is full. Cache replacement needs to address two issues: deciding which data to evict and how to handle the evicted data.

Next, let’s learn about the data eviction strategy in Redis.

What are the eviction policies for Redis cache? #

Before Redis 4.0, there were a total of 6 memory eviction policies implemented, and after 4.0, 2 more policies were added. We can divide them into two categories based on whether data eviction occurs:

  • Policies that do not perform data eviction, with only one policy called ’noeviction'.
  • 7 other policies that perform eviction.

Among the 7 policies that perform eviction, we can further classify them based on the range of the eviction candidate dataset:

  • Eviction within data with expiration time, including four policies: ‘volatile-random’, ‘volatile-ttl’, ‘volatile-lru’, and ‘volatile-lfu’ (added after Redis 4.0).
  • Eviction within all data, including three policies: ‘allkeys-lru’, ‘allkeys-random’, and ‘allkeys-lfu’ (added after Redis 4.0).

I have illustrated the classification of these 8 policies in a diagram:

Now let’s explain each policy in detail.

By default, when Redis’s memory usage exceeds the ‘maxmemory’ value, it does not evict data, which means it uses the ’noeviction’ policy. In the context of Redis cache, this means that once the cache is full, if there is a write request, Redis will return an error instead of providing service. When Redis is used as a cache, the actual data set is usually larger than the cache capacity, and new data will always be written to the cache. This policy does not evict data, so it does not make sense to use it in Redis cache.

Let’s analyze the four eviction policies: ‘volatile-random’, ‘volatile-ttl’, ‘volatile-lru’, and ‘volatile-lfu’. The candidate data range considered by these policies is limited to key-value pairs that have already set an expiration time. Therefore, even if the cache is not full, if these data expire, they will be deleted.

For example, if we use the EXPIRE command to set an expiration time for a batch of key-value pairs, Redis will further evict them according to the specific selection rules of ‘volatile-ttl’, ‘volatile-random’, ‘volatile-lru’, and ‘volatile-lfu’ when the expiration time of these key-value pairs is approaching or when Redis’s memory usage reaches the ‘maxmemory’ threshold.

  • When selecting for eviction, the ‘volatile-ttl’ policy deletes key-value pairs with earlier expiration times.
  • The ‘volatile-random’ policy randomly selects and deletes key-value pairs with expiration times.
  • The ‘volatile-lru’ policy uses the least recently used (LRU) algorithm to select key-value pairs with expiration times.
  • The ‘volatile-lfu’ policy uses the least frequently used (LFU) algorithm to select key-value pairs with expiration times.

As you can see, the selection rules for ‘volatile-ttl’ and ‘volatile-random’ are relatively simple. I will explain the ‘allkeys-lru’ strategy in detail when analyzing the ‘allkeys-lru’ strategy. The ‘volatile-lfu’ strategy uses the LFU algorithm, which I will explain in detail in Lesson 27. For now, you just need to know that it is an optimization of the eviction policy based on the LRU algorithm, taking into account the timeliness and frequency of data access.

Compared to the 4 eviction policies: ‘volatile-ttl’, ‘volatile-random’, ‘volatile-lru’, and ‘volatile-lfu’, which only evict data with expiration times, the candidate eviction range for the 3 policies: ‘allkeys-lru’, ‘allkeys-random’, and ‘allkeys-lfu’ expands to all key-value pairs, regardless of whether they have an expiration time. The selection rules for data eviction in these policies are as follows:

  • ‘allkeys-random’ policy: randomly selects and deletes data from all key-value pairs.
  • ‘allkeys-lru’ policy: uses the least recently used (LRU) algorithm to select data from all key-value pairs.
  • ‘allkeys-lfu’ policy: uses the least frequently used (LFU) algorithm to select data from all key-value pairs. This means that if a key-value pair is selected by the deletion policy, it will be deleted even if its expiration time has not yet arrived. Of course, if its expiration time has arrived but it has not been selected by the policy, it will also be deleted.

Next, let’s take a look at the LRU algorithm used by the volatile-lru and allkeys-lru policies. The working mechanism of the LRU algorithm is not complicated, so let’s learn it together.

The full name of the LRU algorithm is Least Recently Used, which means that data is selected based on the principle of least recently used. The least frequently used data will be selected, while the recently frequently used data will be kept in the cache.

So how does it select the data exactly? LRU organizes all the data into a linked list, with the head and tail representing the Most Recently Used (MRU) end and the Least Recently Used (LRU) end respectively. They represent the most recently and least recently used data in the list. Let’s look at an example.

Suppose we have data 6, 3, 9, 20, and 5. If data 20 and 3 are accessed successively, they will be moved from their original positions in the list to the MRU end, and the data before them in the list will be shifted accordingly. Because when the LRU algorithm selects data to delete, it starts from the LRU end. So moving the recently accessed data to the MRU end allows them to stay in the cache as much as possible.

If there is a new data 15 to be written to the cache, but there is no free space in the list, which means there is no available position in the list, the LRU algorithm does the following:

  1. Data 15 has just been accessed, so it is placed at the MRU end.
  2. The algorithm deletes the data 5 from the cache, and there will be no record of data 5 in the list.

In fact, the idea behind the LRU algorithm is very simple: it assumes that recently accessed data will be accessed again, so it is placed at the MRU end; data that hasn’t been accessed for a long time will not be accessed again, so it gradually moves to the LRU end. When the cache is full, it is deleted first.

However, when implementing the LRU algorithm, managing all the cache data using a linked list incurs additional space overhead. Moreover, when data is accessed, it needs to be moved on the linked list to the MRU end. If a large amount of data is accessed, this requires many list manipulation operations, which can be time-consuming and reduce the performance of the Redis cache.

Therefore, the LRU algorithm in Redis has been simplified to reduce the impact of data eviction on cache performance. Specifically, Redis by default records the timestamp of the last access of each data (recorded in the lru field of the RedisObject data structure). When deciding which data to evict, Redis first randomly selects N data as a candidate set. Then Redis compares the lru field of these N data and evicts the data with the smallest lru value from the cache.

Redis provides a configuration parameter maxmemory-samples, which represents the number N of data selected by Redis. For example, by executing the following command, Redis selects 100 data as the candidate set:

CONFIG SET maxmemory-samples 100

When it is necessary to evict data again, Redis needs to select data to enter the candidate set created during the first evictions. The selection criteria are that the lru value of the data must be smaller than the smallest lru value in the candidate set. When new data enters the candidate set, if the number of data in the candidate set reaches maxmemory-samples, Redis evicts the data with the smallest lru value in the candidate set.

By doing so, Redis does not need to maintain a large linked list for all data, and does not need to move the list items on every data access, which improves the performance of the cache.

All right, by now we have learned about the 5 cache eviction strategies, except for using the LFU algorithm. Here are three suggestions for you:

  • Prefer to use the allkeys-lru strategy. This allows you to make full use of the advantages of the classic LRU caching algorithm by keeping the most frequently accessed data in the cache, thereby improving the application’s access performance. If there is a clear distinction between hot and cold data in your business data, I recommend using the allkeys-lru strategy.
  • If the data access frequency in your business application does not vary much and there is no clear distinction between hot and cold data, it is recommended to use the allkeys-random strategy to randomly select data for eviction.
  • If your business has a need for pinned items, such as pinned news or videos, you can use the volatile-lru strategy and do not set an expiration time for these pinned data items. This way, these pinned data items will never be deleted, while other data items will be selected for eviction based on LRU rules when they expire.

Once the evicted data is determined, how does Redis handle these data? This leads to the specific operations for cache replacement.

How to handle the eliminated data? #

Generally, once the eliminated data is identified, if the data is clean, we can simply delete it. If the data is dirty, we need to write it back to the database, as shown in the following figure:

But how do we determine whether a piece of data is clean or dirty?

The difference between clean data and dirty data lies in whether it has been modified compared to the value originally obtained from the backend database. Clean data has never been modified, so the data in the backend database is also the most up-to-date. When replacing it, the clean data can be directly deleted.

Dirty data, on the other hand, has been modified at some point and is no longer consistent with the data stored in the backend database. In this case, if we don’t write the dirty data back to the database, the latest value of this data will be lost, which will affect the normal use of the application.

In this way, cache replacement not only frees up cache space to cache new data, but also ensures that the dirty data is written back to the database, so that the latest data is not lost.

However, for Redis, once it decides on the eliminated data, it will simply delete them. Even if the eliminated data is dirty, Redis will not write them back to the database. Therefore, when using Redis cache, if the data is modified, it needs to be written back to the database at the time of modification. Otherwise, when this dirty data is eliminated, Redis will delete it, and there will be no latest data in the database.

Summary #

In this lesson, I have discussed data eviction strategies for cache replacement and how to handle the evicted data.

Starting from Redis 4.0, there are a total of 8 data eviction strategies available. We have two candidate sets for eviction: one consists of all data, and the other consists of data with expiration set. Regardless of which candidate set we choose, there are three strategies to select evicted data: random, LRU (Least Recently Used), and LFU (Least Frequently Used). Additionally, when selecting evicted data from the set with expiration, we can also consider the distance from expiration time.

In general, after selecting the evicted data, a cache system will either directly delete or write it back to the database, depending on whether the data is clean or dirty. However, in Redis, the evicted data will always be deleted, regardless of its cleanliness. Therefore, when using Redis cache, we should be particularly careful: when the data becomes dirty, the corresponding modification should be made in the database as well.

Choosing the right cache strategy is worth pondering. It directly affects the efficiency of the cache system by determining whether the selected data can potentially be accessed again.

To illustrate with a simple comparison, if we use the random strategy and the data selected for eviction coincidentally gets accessed again, the application will have to spend a few milliseconds to retrieve the data from the database. On the other hand, if we use the LRU strategy, the selected data has been validated by time, and if it remains untouched for a certain period, the probability of being accessed again is low.

Therefore, my recommendation is to first determine whether there are data that will be frequently accessed (such as top messages) and choose the candidate set for eviction accordingly, either all data or data with expiration set. Once the candidate set is decided, I suggest prioritizing the LRU algorithm, specifically using the allkeys-lru or volatile-lru strategy.

Of course, setting the cache capacity is also important. My suggestion is to consider the total amount of data, the size of hot data, and the budget, and set the cache space within the range of 15% to 30% of the total data size.

One question per lesson #

As usual, I will provide you with a small question. In this lesson, I introduced Redis cache. When dealing with dirty data, it is necessary to write it back to the database at the same time as modifying the data. Considering the cache read-write patterns we discussed in the previous lesson: read-only cache and two write-back strategies in read-write cache, please think about which one or ones does Redis cache correspond to?

Feel free to write down your thoughts and answer in the comments section. Let’s exchange and discuss together. If you find today’s content helpful, you are also welcome to share it with your friends or colleagues. See you in the next lesson.