40 Cache vs. Storms Who Says Cache Storms Can't Be Avoided

40 Cache vs #

Hello, I am Wen Ming.

In the previous lesson on cache, I introduced to you some optimization techniques in terms of high performance with shared dictionary and LRU cache. However, we have left out a very important problem, which is worth its own lesson today: the “cache storm”.

What is a cache storm? #

So what is a cache storm? Let’s first imagine the following scenario.

The data source is stored in a MySQL database, and the cached data is stored in a shared dictionary with a timeout of 60 seconds. During this 60-second period, all requests retrieve data from the cache, and there is no pressure on the MySQL database. However, once the 60 seconds are up and the cached data is invalidated, if a large number of concurrent requests come in and no results are found in the cache, the function to query the data source will be triggered, causing all of these requests to query the MySQL database directly. This can lead to database server performance degradation or even a complete halt.

This phenomenon is called a “cache storm,” and it also has a corresponding English name, “Dog-Pile.” Clearly, the preceding code related to caching did not handle this issue. For example, the following code snippet is a pseudocode that has the potential for a cache storm:

local value = get_from_cache(key)
if not value then
    value = query_db(sql)
    set_to_cache(value, timeout = 60)
end
return value

At first glance, this pseudocode appears to be logically sound, and it won’t trigger a cache storm when you use unit tests or end-to-end tests. Only long-term stress testing will reveal this problem, as there will be regular peaks in the number of database queries every 60 seconds. However, if you set the cache expiration time to be relatively long, the likelihood of discovering a cache storm problem will decrease.

How to Avoid Cache Storms? #

Now that we understand what a cache storm is, our next step is to figure out how to avoid it. Let’s discuss several different scenarios.

Proactive Cache Updates #

In the pseudocode above, the cache is updated passively. It only queries the database for new data when the terminal request finds the cache invalid. To avoid cache storms, we can change the cache update from passive to proactive.

In OpenResty, we can achieve this by using ngx.timer.every to create a scheduled task that runs every minute to fetch the latest data from the MySQL database and store it in a shared dictionary:

local function query_db(premature, sql)
    local value = query_db(sql)
    set_to_cache(value, timeout = 60)
end

local ok, err = ngx.timer.every(60, query_db, sql)

Then, in the terminal request code logic, remove the MySQL query part and keep only the code that retrieves the data from the shared dictionary cache:

local value = get_from_cache(key)
return value

By performing these two operations, we bypass the cache storm issue. However, this approach is not perfect because each cache needs to correspond to a periodic task (OpenResty has a limit on timers, so there can’t be too many). Moreover, the cache expiration time and the schedule task cycle time need to be coordinated. If there is any oversight, the terminal might continuously receive empty data.

Therefore, in real projects, we usually use locking mechanisms to solve the cache storm problem. Next, I will explain several different locking methods, and you can choose the one that suits your needs.

lua-resty-lock #

I know that at the mention of locking, many people might frown, thinking that it is a heavy operation. What if a deadlock occurs? Obviously, there will be more exception handling to deal with.

However, using the lua-resty-lock library in OpenResty can alleviate these concerns. lua-resty-lock is a built-in resty library in OpenResty that provides a non-blocking lock API based on shared dictionaries. Let’s start with a simple example:

resty --shdict='locks 1m' -e 'local resty_lock = require "resty.lock"
                            local lock, err = resty_lock:new("locks")
                            local elapsed, err = lock:lock("my_key")
                            -- query db and update cache
                            local ok, err = lock:unlock()
                            ngx.say("unlock: ", ok)'

Because lua-resty-lock is based on shared dictionaries, we need to declare the name and size of the shdict in advance. Then, we use the new method to create a lock object. As you can see from this code, we only pass the name of the shdict as the first parameter. In fact, the new method has a second parameter that can be used to specify multiple parameters such as lock expiration time and timeout for waiting to acquire the lock. However, in this case, we are using default values, which are designed to avoid deadlocks and other exceptional problems.

After that, we can call the lock method to attempt to acquire the lock. If the lock is successfully acquired, we can ensure that only one request fetches the data from the data source. In case the lock is already taken or the timeout occurs, we need to retrieve the data from the outdated cache and return it to the terminal. This process may sound familiar to you. Yes, this is where we can use the get_stale API we introduced in the previous lesson:

local elapsed, err = lock:lock("my_key")
# elapsed nil indicates lock acquisition failure, and err returns either timeout or locked
if not elapsed and err then
    local dict = ngx.shared.cache_dict

    dict:get_stale("my_key")
end


If the `lock` is successful, we can safely query the database and update the result to the cache. Finally, we call the `unlock` interface to release the lock.

Combining `lua-resty-lock` and `get_stale`, we have perfectly solved the cache stampede problem. In the documentation of `lua-resty-lock`, there is a very complete handling code. I recommend you to click on the [link](https://github.com/openresty/lua-resty-lock#for-cache-locks) to view it.

However, whenever we encounter interesting implementations, we always want to see how the source code is implemented. Here, let's take a closer look at how the `lock` interface locks. The following is its source code:

```lua
    local ok, err = dict:add(key, true, exptime)
    if ok then
        cdata.key_id = ref_obj(key)
        self.key = key
        return 0
    end

In the shared dictionary section, I mentioned that all API of the shared dictionary are atomic operations, so there is no need to worry about competition. Therefore, using the shared dictionary to mark the lock status is a good idea.

The implementation of the lock interface here uses the dict:add interface to attempt to set the key. If the key does not exist in shared memory, the add interface will return success, indicating that the lock is successful. Other concurrent requests that reach the code logic of the line with dict:add will return failure, and then depending on the returned error information, choose whether to return directly or retry multiple times.

lua-resty-shcache #

However, in the implementation of lua-resty-lock above, you need to handle various issues such as locking, unlocking, getting stale data, retrying, exception handling, etc. yourself, which is quite cumbersome. So here, I will introduce another simple encapsulation: lua-resty-shcache.

lua-resty-shcache is a lua-resty library open-sourced by CloudFlare. It adds a layer of encapsulation on top of the shared dictionary and external storage, and also provides serialization and deserialization interfaces, so that you don’t have to worry about the various details mentioned above:

    local shcache = require("shcache")

    local my_cache_table = shcache:new(
            ngx.shared.cache_dict,
            { external_lookup = lookup,
              encode = cmsgpack.pack,
              decode = cmsgpack.decode,
            },
            { positive_ttl = 10,           -- cache good data for 10s
              negative_ttl = 3,            -- cache failed lookup for 3s
              name = 'my_cache',     -- "named" cache, useful for debug / report
            }
        )

    local my_table, from_cache = my_cache_table:load(key)

This example code is from the official example, but I have hidden the details to help you grasp the key points. In fact, this cache encapsulation library is not our best choice, but it is more suitable for beginners to learn, so I introduced it first. In the next class, I will introduce several other better and more commonly used encapsulations, making it easier for us to choose the most appropriate one to use.

Nginx Configuration Directive #

In addition, even if you don’t use the lua-resty library of OpenResty, you can use Nginx configuration directives to implement locking and getting expired data-using proxy_cache_lock and proxy_cache_use_stale. However, I do not recommend using Nginx directives in this way, as it is obviously not flexible enough and the performance is not as good as Lua code.

Conclusion #

In this lesson, I mainly introduced cache storms and several corresponding approaches to handle them. It must be said that cache storms, like the race issues we have mentioned before, are difficult to be discovered through code review and testing. The best method is to improve our coding skills or use encapsulated libraries to solve such problems.

Finally, I will leave you with a homework question. How do you handle cache storms or similar issues in the language and platform you are familiar with? Are there better ideas and methods than OpenResty? Feel free to leave a comment and discuss this issue with me. You are also welcome to share this article with your colleagues and friends to learn and progress together.