39 High Performance Secrets 'Shared Dict' Cache and 'Lru' Cache

39 High-Performance Secrets- ‘shared dict’ Cache and ’lru’ Cache #

Hello, I’m Wen Ming.

In the previous lessons, I have introduced the optimization techniques and performance tuning tools of OpenResty, involving strings, tables, Lua API, LuaJIT, SystemTap, Flame Graph, and so on.

These are the foundation of system optimization and it’s crucial for you to master them. However, understanding these alone is not enough to deal with real-world business scenarios. In a slightly more complex business, maintaining high performance is a systematic work that goes beyond code and gateway optimization. It will involve various aspects such as databases, networks, protocols, caching, disks, etc., which is also the purpose of architects.

In today’s lesson, let’s take a look at the key component that plays a crucial role in performance optimization - caching, and see how it is used and optimized in OpenResty.

Caching #

At the hardware level, most computer hardware uses caching to improve speed. For example, CPUs have multiple levels of cache, and RAID cards have read/write caches. At the software level, the database we use is a good example of a well-designed cache. In SQL statement optimization, index design, and various disk I/O operations, caching is involved.

Here, I also suggest that before designing your own cache, you first understand the various caching mechanisms in MySQL. I recommend the book “High Performance MySQL.” I benefited greatly from this book when I was in charge of databases many years ago, and later I also borrowed from MySQL’s design in many other optimization scenarios.

Speaking of caching, we know that a production environment caching system needs to find the best solution based on its own business scenarios and system bottlenecks. This is a balancing act.

Generally speaking, there are two principles for caching.

First, the closer to the user’s request, the better. For example, if it can be cached locally, there is no need to send an HTTP request; if it can be cached in a CDN, there is no need to hit the origin server; if it can be cached in OpenResty, there is no need to hit the database.
Second, try to use in-process and local caching solutions. Because when crossing processes, machines, or even data centers, the network overhead of caching can be significant, which becomes particularly apparent in high concurrency scenarios.

Naturally, the design and use of caching in OpenResty also adhere to these two principles. OpenResty has two caching components: shared dict cache and LRU cache. The former can only cache string objects, and the cached data has one and only one copy, which can be accessed by each worker, so it is commonly used for data communication between workers. The latter can cache all Lua objects but can only be accessed within a single worker process. The number of cache data copies is equal to the number of worker processes.

The following two simple tables can explain the difference between shared dict and LRU cache:

There is no definitive answer as to whether shared dict or LRU cache is better; it depends on your scenario.

If you don’t have a need to share data between workers, then LRU cache can cache complex data types like arrays and functions, and has the best performance, so it is the preferred option.
However, if you need to share data between workers, you can add shared dict cache on top of the LRU cache to create a two-level cache architecture.

Next, let’s take a closer look at these two caching methods.

Shared Dictionary Cache #

In the Lua section, we have already provided a detailed introduction to shared dict. Here, let’s briefly review its usage:

$ resty --shdict='dogs 1m' -e 'local dict = ngx.shared.dogs
                               dict:set("Tom", 56)
                               print(dict:get("Tom"))'

You need to declare a memory zone called “dogs” in the Nginx configuration file before you can use it in Lua code. If you find that the allocated space for “dogs” is not enough during use, you need to modify the Nginx configuration file and then reload Nginx to take effect. We cannot expand or shrink the shared dictionary during runtime.

Now, let’s focus on several performance-related issues in shared dictionary cache.

Serialization of Cached Data #

The first issue is the serialization of cached data. Because shared dict can only cache string objects, if you want to cache an array, you need to serialize it when using set and deserialize it when using get:

resty --shdict='dogs 1m' -e 'local dict = ngx.shared.dogs
                        dict:set("Tom", require("cjson").encode({a=111}))
                        print(require("cjson").decode(dict:get("Tom")).a)'

However, these serialization and deserialization operations consume a significant amount of CPU resources. If there are several such operations in each request, you can clearly see their impact on the flame graph.

So, how can we avoid this overhead in the shared dictionary? In fact, there is no good solution here. Either avoid putting arrays in shared dict at the business level, or manually concatenate strings into JSON format. Of course, this will bring performance overhead of string concatenation and may hide more bugs.

Most serializations can be decomposed at the business level. You can split the contents of the array and store them in the shared dictionary separately as strings. If that doesn’t work, you can also cache the array in LRU, trading memory space for program convenience and performance.

In addition, the keys in the cache should be as short and meaningful as possible. This not only saves space but also facilitates subsequent debugging.

Stale Data #

The shared dictionary also provides a method called get_stale to read data, which, compared with the get method, returns expired data in addition to the regular value:

resty --shdict='dogs 1m' -e 'local dict = ngx.shared.dogs
                            dict:set("Tom", 56, 0.01)
                            ngx.sleep(0.02)
                             local val, flags, stale = dict:get_stale("Tom")
                            print(val)'

In the above example, the data is cached in the shared dictionary for only 0.01 seconds. After 0.02 seconds after “set”, the data has already expired. At this time, get will not retrieve the data, but using get_stale may still retrieve the expired data. However, there is a certain probability that the space occupied by expired data will be reclaimed and used by other data, which is called the LRU algorithm.

You may wonder, what is the use of retrieving expired data? Don’t forget that we store cached data in the shared dict, and even if the cached data expires, it does not necessarily mean that the source data has been updated.

For example, the data source is stored in MySQL. After obtaining the data from MySQL and setting a timeout of 5 seconds in the shared dict, we have two options when the data expires:

When the data does not exist, query MySQL again and put the result into the cache.
Verify if the data in MySQL has changed. If there are no changes, retrieve the expired data from the cache, modify its expiration time, and keep it effective.

Obviously, the latter is the more optimized solution, which minimizes the interaction with MySQL and allows the terminal requests to fetch data from the fastest cache.

At this point, how to determine whether the data in the data source has changed becomes a problem that we need to consider and solve. Next, let’s take an LRU cache as an example to see how an actual project solves this problem.

LRU Cache #

The LRU cache interface only has 5 methods: new, set, get, delete, and flush_all. In the context of the mentioned question, we are only concerned with the get method. Let’s take a look at how this method is used:

resty -e 'local lrucache = require "resty.lrucache"
local cache, err = lrucache.new(200)
cache:set("dog", 32, 0.01)
ngx.sleep(0.02)
local data, stale_data = cache:get("dog")
print(stale_data)'

As you can see, in the LRU cache, the second return value of the get method is directly stale_data, instead of being split into separate get and get_stale APIs like in shared dictionaries. This kind of interface encapsulation is obviously more user-friendly for using expired data.

In real-world projects, it is generally recommended to use version numbers to differentiate different data. This way, when the data changes, its version number also changes. For example, in etcd, the modifiedIndex can be used as a version number to indicate whether data has changed. With the concept of version numbers, we can make a simple secondary encapsulation of the LRU cache, as shown in the following pseudocode taken from https://github.com/iresty/apisix/blob/master/lua/apisix/core/lrucache.lua:

local function (key, version, create_obj_fun, ...)
    local obj, stale_obj = lru_obj:get(key)
    -- If the data is not expired and the version has not changed, return the cached data directly
    if obj and obj._cache_ver == version then
        return obj
    end

    -- If the data has expired but can still be obtained and the version has not changed, return the expired data directly from the cache
    if stale_obj and stale_obj._cache_ver == version then
        lru_obj:set(key, obj, item_ttl)
        return stale_obj
    end

    -- If no expired data is found or the version number has changed, fetch the data from the data source
    local obj, err = create_obj_fun(...)
    obj._cache_ver = version
    lru_obj:set(key, obj, item_ttl)
    return obj, err
end

From this code snippet, you can see that by introducing the concept of version numbers, we can make full use of expired data to reduce pressure on the data source in the absence of version changes, thereby achieving optimal performance.

In addition to the above solution, there is actually another potentially significant optimization point. That is, we separate the key from the version number and make the version number an attribute of the value.

We know that a more common practice is to write the version number into the key itself. For example, the key value may be key_1234, which is very common. However, in the OpenResty environment, this approach is actually inefficient. Why is that?

Let’s take an example to understand. If the version number changes once every minute, then key_1234 will become key_1235 after one minute. In one hour, it will generate 60 different keys and 60 values. This means that the Lua GC needs to collect Lua objects behind 59 key-value pairs. If your updates are more frequent, the creation and GC of objects will obviously consume more resources.

Of course, these costs can be easily avoided by moving the version number from the key to the value. This way, regardless of how frequently a key is updated, there are only two fixed Lua objects. It can be seen that this optimization technique is very clever. However, behind such simple and clever techniques, you actually need a deep understanding of the OpenResty API and cache mechanisms.

Conclusion #

It is true that the documentation for OpenResty is quite detailed, but how to combine it with your own business to maximize optimization results is something you need to explore and comprehend on your own. Many times, small phrases or concepts in the documentation, such as “stale data,” can actually make a huge difference in performance.

So, have you had similar experiences while using OpenResty? Feel free to leave a comment and share your thoughts with me. Also, please feel free to share this article so that we can learn and progress together.