02 Cache Consistency How to Deal With Data Update Cache Inconsistencies in Read Heavy Write Light Situations

02 Cache Consistency How to Deal with Data Update Cache Inconsistencies in Read-Heavy Write-Light Situations #

Hello, I’m Xu Changlong. Let’s continue exploring caching techniques for improving performance in the user center.

In the previous lesson, we categorized and organized the data to make it more cache-friendly. To reduce the pressure on the database, we need to gradually introduce caching to the system. Therefore, in this lesson, I will show you how to use temporary caching or persistent caching to handle high-concurrency queries in various user center business scenarios. This will help you master the relevant techniques for ensuring cache data consistency under high-concurrency traffic.

As we mentioned before, most data in internet business scenarios is read-mostly with a write-few ratio that can reach one percent or even one thousandth.

For the user center business, this ratio may be even greater since users do not frequently update their own information and passwords. Therefore, this read-mostly, write-few scenario is especially suitable for read caching. By using caching, we can significantly reduce the query pressure on the system’s data layer and achieve better concurrency query performance. However, using caching often leads to the issue of data inconsistency after updates. Let’s take a closer look at this problem.

Cost-effectiveness of Caching #

Can caching be abused? This interesting question emerged when optimizing the user center.

As mentioned earlier, we believed that storing user information in cache can greatly improve performance. Therefore, our initial optimization plan was to cache the user account information in the user center. The table containing this information consists of 20 million records and is primarily used to retrieve user accounts from the database based on the account and password provided by the user during login. This process confirms whether the account and password are correct, and also checks if the account is banned, in order to determine whether the user can log in:

# Table structure
CREATE TABLE `accounts` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `account` varchar(15) NOT NULL DEFAULT '',
  `password` char(32) NOT NULL,
  `salt` char(16) NOT NULL,
  `status` tinyint(3) NOT NULL DEFAULT '0',
  `update_time` int(10) NOT NULL DEFAULT '0',
  `create_time` int(10) NOT NULL DEFAULT '0',
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;

# Login query
select id, account, update_time from accounts 
where account = 'user1'
and password = '6b9260b1e02041a665d4e4a5117cfe16'
and status = 1

This is a very simple query. You may think: if we cache the 20 million user data, it should significantly improve the performance of the system.

This idea is partially correct, but not entirely. The cost-effectiveness of this approach is not very high. This table is mainly used for account login, and even if users frequently log in, it does not generate a significant amount of traffic. Therefore, most of the time, the cache remains idle. There is no need to store infrequently accessed data in the cache, as it would waste our budget.

This leads to a crucial question: we need to consider the cost-effectiveness of caching. If we painstakingly cache certain data but it does not actually improve system performance, and instead wastes a significant amount of time and money, then it is not appropriate. We need to evaluate the effectiveness of caching, and generally, only hot data is more valuable when stored in a cache.

Temporary Caching #

After rejecting the idea of putting all account information into the cache, we shifted our focus to information that would be frequently queried, which is user information.

User information is used frequently and will be queried and displayed in many scenarios. For example, the avatar, nickname, and gender of a user that we see on a forum are all data that needs to be displayed frequently. However, the total amount of this data is large, and putting it all into the cache would be a waste of space.

For this type of data, I suggest using temporary caching. This means that when user information is first used, it is also stored in the cache. If there is a similar query within a short period of time, the information can be quickly retrieved from the cache. This method can effectively reduce the query pressure on the database. The code for implementing temporary caching using common methods is as follows:

// Try to directly retrieve user information from the cache
userinfo, err := Redis.Get("user_info_9527")
if err != nil {
  return nil, err
}

// Cache hit, directly return the user information
if userinfo != nil {
  return userinfo, nil
}

// Cache miss, retrieve from the database
userinfo, err := userInfoModel.GetUserInfoById(9527)
if err != nil {
  return nil, err
}

// User information found
if userinfo != nil {
  // Cache the user information and set a TTL (Time to Live) timeout for it to expire in 60 seconds
  Redis.Set("user_info_9527", userinfo, 60)
  return userinfo, nil
}

// Not found, insert an empty data into the cache to avoid redundant queries to the database within a short period of time
// Optional, this is used to prevent cache penetration and query attacks
Redis.Set("user_info_9527", "", 30)
return nil, nil

As you can see, our data is only temporarily stored in the cache. After 60 seconds of expiration, the data will be evicted. If there is a need for the same data query, our code will fill in the data into the cache again for continued use. This temporary caching method is suitable for cases where there is a large amount of data in the table but few hot data, which helps reduce the pressure on hot data.

The reason for setting a TTL for the cache is to save memory space. When the data is not used for a period of time, it will be evicted, so we don’t have to purchase a large amount of memory. This approach has a high cost-effectiveness, is relatively simple to maintain, and is commonly used.

Problem of Delayed Cache Updates #

Temporary caches have a TTL (Time to Live), which means that if a user’s nickname is modified within 60 seconds, the cache will not be immediately updated. The worst case scenario is that the user’s nickname cache will only be refreshed 60 seconds later, which obviously causes unnecessary trouble to the system. In fact, for this type of cache data refresh, there can be several different situations, and the refresh methods for each situation are different. Let me explain each one in detail.

1. Refreshing Cache for a Single Entity Data #

Refreshing the cache for a single entity data is the simplest way. For example, if we cache the info of the user with ID 9527, when we make a modification to this data, we can synchronize the corresponding data cache update with the data update:

type UserInfo struct {
    Id         int    `gorm:"column:id;type:int(11);primary_key;AUTO_INCREMENT" json:"id"`
    Uid        int    `gorm:"column:uid;type:int(4);NOT NULL" json:"uid"`
    NickName   string `gorm:"column:nickname;type:varchar(32) unsigned;NOT NULL" json:"nickname"`
    Status     int16  `gorm:"column:status;type:tinyint(4);default:1;NOT NULL" json:"status"`
    CreateTime int64  `gorm:"column:create_time;type:bigint(11);NOT NULL" json:"create_time"`
    UpdateTime int64  `gorm:"column:update_time;type:bigint(11);NOT NULL" json:"update_time"`
}

// Update the user's nickname
func (m *UserInfo) UpdateUserNickname(ctx context.Context, name string, uid int) (bool, int64, error) {
    // Update the database first
    ret, err := m.db.UpdateUserNickNameById(ctx, uid, name)
    if ret {
        // Then clear the cache to refresh it when reading next time, to prevent temporary data from entering the cache due to concurrent modifications.
        // This method of refreshing is fast and easy to use, with low maintenance costs.
        Redis.Del("user_info_" + strconv.Itoa(uid))
    }
    return ret, count, err
}

In summary, we first identify the ID of the modified data, and then delete the modified data cache based on the ID. When the next request arrives, the latest data will be updated in the cache, which effectively reduces the possibility of dirty data being brought into the cache by concurrent operations.

In addition to that, we can also send update messages to a queue to let the subsystem update the cache, or develop middleware to send data operations to the subsystem to decide the scope of data updates.

However, in the step of sending update messages to the queue, we will encounter a problem - the operation of conditionally updating multiple IDs cannot know how many IDs may have been modified. The common practice is to first use the same condition to retrieve all the related IDs, and then perform an update. At this time, we update the specific cache with all related IDs.

2. Cache Refresh for Relational and Statistical Data #

There are many methods for cache refresh of relational or statistical data, and I will explain some of the most commonly used ones.

First is the manual cache maintenance method. We know that refreshing the cache for relational data or statistical results is difficult because these statistics are calculated from multiple data. When we update the cache for this type of data, it is difficult to identify which related cache needs to be refreshed. Therefore, we need to manually record or define special refreshing logic in one place to update the relevant cache.

However, this method is more complex. If there are many caches to be refreshed, the cache update will be relatively slow and there will be delays. Moreover, manual coding also needs to consider how to find all the IDs associated with the newly added data, which can be very troublesome to maintain.

In addition to manual cache maintenance, another way is to identify ID data changes through subscribing to the database. As shown in the following diagram, we can use Maxwell or Canal to monitor updates to MySQL.

This way, the change information will be pushed to Kafka, and we can determine the data IDs involved in the updates based on the corresponding table and specific SQL. Then, we can update the corresponding keys of the cache based on the pre-set logic in the scripts. For example, if a user updates their nickname, the cache update service will know that the cache for “user_info_9527” needs to be updated, and it will also find and delete all other related caches based on the configuration.

Clearly, the benefit of this approach is that it can update simple caches in a timely manner, and the core system can broadcast data changes to the subsystems. The code is not complex either. The disadvantage is that refreshing complex relational caches still requires manually coding the logic.

If data updates in our table are rare, we can use the version-based cache design.

This approach is more straightforward: once there is any update, all the data caches in the table expire together. For example, set a key for the user_info table, let’s say “user_info_version”. When we update the data in this table, we directly increment the value of “user_info_version” by +1. When writing to the cache, we also record the current value of user_info_version in the cache data.

When a business needs to read the information of a specific user in the user_info table, the business will also obtain the current version of the table. If it finds that the version in the cache data is different from the current version of the table, it will update that specific data. However, if the version is updated frequently, it will seriously reduce the cache hit rate. Therefore, this approach is suitable for tables with few updates.

Of course, we can also split the table into blocks based on a range, such as splitting it into multiple versions based on the ID range, to reduce the range and frequency of cache refreshing.

In addition, cache refreshing for relational data can also be done by identifying the primary entity ID. This requires ensuring that the keys stored in other caches are also the primary entity ID. Therefore, when any associated data is changed, all caches can be refreshed based on the primary entity ID. The drawback of this approach is that our cache needs to be able to find the associated primary entity ID based on the modified data.

Finally, let me introduce another approach: asynchronously traversing the database with a script to refresh all related caches. This approach is suitable for synchronizing data between two systems, reducing interface interactions between systems. The disadvantage is that after deleting data, it is necessary to manually delete the corresponding cache, so there will be delays in the updates. However, if it can be combined with subscribing to update messages broadcast, it can achieve quasi-synchronization.

Long-term Hot Data Caching #

Now, let’s take a look back at the pseudocode for temporary caching we discussed earlier. Although it can solve most problems, please consider this: if a large number of cache requests do not hit, will the traffic that bypasses the cache overwhelm our database? This is actually the problem of cache penetration often mentioned in the industry. If there is a large-scale concurrent penetration of the cache, it may lead to our service downtime.

Therefore, if the database cannot withstand the usual traffic, we cannot use the temporary cache approach to design the cache system. We can only use the long-term cache approach to implement hot data caching in order to avoid the problem of cache penetration overwhelming the database. However, to achieve long-term caching, we need to do more work manually to maintain the consistency between the cache and the data in the data table.

It is worth noting that the use of long-term caching became popular only after the rise of NoSQL. The main reason is that the implementation of long-term caching is different from that of temporary caching. It requires our business to largely avoid the database, and the data required during the service operation must be found in the cache, while also ensuring that the cache does not get lost during usage.

As a result, we need to know exactly what data is in the cache and preheat these data in advance. Of course, if the data scale is small, we can consider caching all the data, which would be relatively simpler.

To deepen our understanding and demonstrate special techniques, let’s take a look at an interesting implementation of “temporary cache + long-term hot cache”. This approach may have a small-scale cache penetration and the code is relatively complex, but overall, the cost is relatively low:

// Try to directly retrieve user information from the cache
userinfo, err := Redis.Get("user_info_9527")
if err != nil {
  return nil, err
}

// If the cache hit, return the user information directly
if userinfo != nil {
  return userinfo, nil
}

// Check if the current key is a hot data
// We did not use a Bloom Filter because there is a chance of collision, 
// if the number of keys exceeds a thousand, it is recommended to use a Bloom Filter
// This judgment can also be placed in the business logic code and synchronized with the configuration
isHotKey, err := Redis.SISMEMBER("hot_key", "user_info_9527")
if err != nil {
  return nil, err
}

// If it is a hot key
if isHotKey {
  // If not found, consider the data does not exist, 
  // it may be deleted
  return "", nil
}

// If the cache is not hit and not marked as a hotspot, 
// it is considered temporary cache, then retrieve it from the database
// Set the update lock: SET user_info_9527_lock nx ex 5
// Prevent multiple threads from querying the database concurrently, 
// which could cause excessive database load
lock, err := Redis.Set("user_info_9527_lock", "1", "nx", 5)
if !lock {
  // If the lock is not obtained, wait for 1 second, 
  // then fetch the result again, similar to the implementation of singleflight
  // Common caching services have strong read concurrency capabilities, 
  // but their write concurrency capabilities are not good, 
  // too high parallel refreshing will saturate the cache
  time.sleep(time.second)
  // Get the data after waiting for 1 second, 
  // this data is filled in by the request that obtains the lock
  // This way, the database load is reduced
  userinfo, err := Redis.Get("user_info_9527")
  if err != nil {
    return nil, err
  }
  return userinfo, nil
}

// Get the lock, query the database, and then fill it into the cache
userinfo, err := userInfoModel.GetUserInfoById(9527)
if err != nil {
  return nil, err
}

// User information is found,
if userinfo != nil {
  // Cache the user information and set a TTL timeout to expire it after 60 seconds
  Redis.Set("user_info_9527", userinfo, 60)
  return userinfo, nil
}

// Not found, put an empty data in, do not query the database in the short term
Redis.Set("user_info_9527", "", 30)
return nil, nil

As you can see, this approach combines long-term caching and temporary caching. When we want to query user information, if there is no data in the cache, the long-term caching will directly return “not found”, while the temporary caching will go through the update process directly. Furthermore, if our user information belongs to a hot key and cannot be found in the cache, it will directly return “data does not exist”.

During the update, in order to prevent high-concurrency queries from overwhelming the database, we have made a simple singleflight (request merging) optimization for the update process. Only the thread that grabs the cache update lock can enter to query the database in the backend and fill the result into the cache. Threads that fail to obtain the update lock will sleep for 1 second and then directly read the cache to obtain the result. This ensures that multiple threads in the backend will not read the same data, thus overwhelming the cache and database services (the write concurrency of the cache is not as good as the read performance).

In addition, the hot_key list (the list of hot keys for long-term caching) is replicated and stored in multiple Redis instances. If you want to read it, you can randomly select one shard to obtain the full configuration.

These hot cache keys are derived from statistical analysis of data access traffic over a period of time. The long-term cache update is performed by an asynchronous script that periodically scans the hot cache list. In this way, the cache is proactively pushed, and the TTL is set to a longer time to ensure that the new hot data cache does not expire. After the heat of this key is over, the hot cache key will be removed from the current set, allowing space for other places to use.

Of course, if we have a large cache cluster and all our data is hot data, we can completely separate from the database and store the data directly in the cache for external service. In this way, we will achieve better throughput and concurrency.

Finally, there is another way to mitigate high-concurrency queries for hot spots. Deploy a small Redis cache on each business server to store hot cache data, and synchronize the hot data to the local Redis on each server through a script. Before querying the data each time, check the local Redis first. If the data is not found, then query the data from the large cache. This way, the caching read performance is improved.

Summary #

Through this lesson, I hope you understand that not all data stored in the cache will bring great benefits. We need to analyze it from three aspects: data volume, usage frequency, and cache hit rate. Caching data with high read-to-write ratio can reduce the pressure on the data layer, but we need to update the cached data according to consistency requirements. Among them, caching a single entity data is the easiest to update, but it is not easy to achieve real-time updates for conditionally queried statistical results.

In addition, if the database cannot withstand the pressure of passing traffic, we need to turn some hot data into long-term cache to prevent a large number of requests from bypassing the cache, which would affect the stability of our service. At the same time, we use the singleflight approach to prevent temporary caches from being penetrated by a large number of requests, in order to prevent hot data from crashing the database before switching from temporary cache to hot cache.

I have also created a mind map for caching techniques with high read-to-write ratio, as shown below:

Reflection Questions #

When using a Bloom Filter to identify hot keys, there may be occasions when it mistakenly fails to identify a key, resulting in the data not being found. How can this situation be avoided?
With a Bloom Filter, it is only possible to add new keys and not to remove a specific key. Is there any other way to update and maintain the Bloom Filter more effectively?

Feel free to discuss and exchange ideas in the comments section. See you in the next class!