08 Case Analysis How Redis Helps Achieve Lightning Fast Business Transactions

08 Case Analysis- How Redis Helps Achieve Lightning-Fast Business Transactions #

In the previous lesson, we used Guava’s LoadingCache as an example to introduce the characteristics and some precautions of in-memory caching. At the same time, we also learned about the scenarios where caching is used, which are equally applicable to distributed caching.

So, what is distributed caching? It is actually a concept of centralized management. If our service has multiple nodes, each node will have its own copy of in-memory cache. With distributed caching, all nodes share the same cache, saving space and reducing management costs.

In the field of distributed caching, Redis is the most commonly used. Redis supports a wide range of data types, including string, list, set, sorted set, hash, and other commonly used data structures. Of course, it also supports some other data structures such as bitmaps.

Speaking of Redis, we cannot fail to mention another distributed cache called Memcached (hereinafter referred to as MC). MC is rarely used nowadays, but it is often asked about the differences between the two in interviews. Here is a brief comparison:

Drawing 0.png

Redis is almost a standard in the internet industry. Next, let’s take a quick look at how Redis is used in Spring, and then we’ll introduce how Redis helps us deal with instantaneous traffic in flash sale businesses.

How to Use Redis with Spring Boot #

Using Spring Boot, it’s easy to operate Redis (see complete code in the repository). There are three commonly used Java Redis clients: Jedis, Redisson, and Lettuce. Spring defaults to using Lettuce.

Lettuce is developed using Netty and its operations are asynchronous, which makes it more performant than the commonly used Jedis. Redisson is also asynchronous, but it encapsulates commonly used business operations, making it suitable for writing code with business meaning.

Redis can be easily used by adding the following jar dependency:

<dependency> 
    <groupId>org.springframework.boot</groupId> 
    <artifactId>spring-boot-starter-data-redis</artifactId> 
</dependency>

With the above approach, we mainly use the RedisTemplate class. It abstracts the corresponding method groups for different data types.

Drawing 1.png

Another approach is to use the Spring abstracted cache package called spring-cache. It uses annotations and adopts an AOP approach to abstract the cache layer, allowing switching between various in-memory caching frameworks and distributed frameworks. Here is its Maven coordinate:

<dependency> 
    <groupId>org.springframework.boot</groupId> 
    <artifactId>spring-boot-starter-cache</artifactId> 
</dependency>

Similar to spring-cache, there is also Alibaba’s JetCache, which is also very useful.

Using spring-cache involves three steps:

  • Add the @EnableCaching annotation to the startup class.
  • Initialize the caching framework to be used using CacheManager and inject the required resources using the @CacheConfig annotation.
  • Use annotations such as @Cacheable to cache resources.

In our case, we use RedisCacheManager, and since there is currently only one initialization instance, the second step can be omitted.

There are three annotations for caching operations:

  • @Cacheable caches the return value of the method if the value is not present in the cache system.
  • @CachePut caches the return value of the method every time it’s executed.
  • @CacheEvict clears certain cached values when the method is executed.

For a flash sale system, using only these three annotations has limitations, so it is necessary to use more low-level APIs, such as RedisTemplate, to complete logic development. In the following, we will introduce some important features.

Introduction to Flash Sale Business #

Flash sale is a test for normal business processes. It generates a sudden surge in traffic, where the requests that would normally be spread throughout the day need to be completed within a few seconds. For example, in certain flash sales on JD.com, the inventory may only be a few hundred, but the instantaneous influx of traffic could be in the millions.

Drawing 2.png

If participants in the flash sale have to wait for a long time, the user experience would be very poor. Just imagine the congestion at a toll booth on a highway, and you can empathize with the feelings of flash sale customers. At the same time, the resources being sold out become hotspots, leading to concurrent competition. For example, if 12306 (China Railway’s official ticket-booking website) were to solely use a database to handle flash sale ticket requests, it would lead to serious locking conflicts, which is the challenge of flash sale businesses.

Recall the content from the previous lesson. At this point, there is a significant mismatch in speed between the flash sale frontend requirements and the database. Moreover, the flash sale resources are hotspots. In this scenario, it is very appropriate to use caching.

There are three key techniques for handling flash sale businesses:

  • First, select the fastest memory as the data write destination.
  • Second, use asynchronous processing instead of synchronous requests.
  • Third, use distributed horizontal scaling.

Now, let’s take a look at how Redis helps with flash sale.

Completing the Flash Sale with Lua Script #

A flash sale system is very complex and can generally be divided into the following three stages:

  • Preparation stage: In this stage, some necessary data is loaded into the cache in advance, and the business data is preheated. Users will continuously refresh the page to check if the flash sale has started.
  • Purchase stage: This is the actual flash sale stage, which generates instantaneous high-concurrency traffic to perform concentrated operations on resources.
  • Settlement stage: This stage mainly ensures data consistency, handles exceptional situations, and performs inventory replenishment operations.

Drawing 4.png

Next, I will introduce the most important flash sale stage.

We can design a Hash data structure to support inventory deduction.

seckill:goods:${goodsId}{ 
    total: 100, 
    start: 0, 
    alloc:0 
}

In this Hash data structure, there are three important parts:

  • total is a static value that represents the quantity of goods to be flash sold. It will be loaded into the cache before the flash sale starts.

  • start is a boolean value. Its initial value is 0 before the flash sale starts. Through backend or scheduled operations, this value is changed to 1 to indicate that the flash sale has started.

  • At this point, alloc will record the quantity of goods that have been flash sold until it reaches the upper limit defined by the total value.

    static final String goodsId = “seckill:goods:%s”;
    String getKey(String id) { return String.format(goodsId, id); } public void prepare(String id, int total) { String key = getKey(id); Map<String, Integer> goods = new HashMap<>(); goods.put(“total”, total); goods.put(“start”, 0); goods.put(“alloc”, 0); redisTemplate.opsForHash().putAll(key, goods); }

During the seckill, we need to first check the inventory before locking it. These two steps are not atomic. In a distributed environment, when multiple machines simultaneously operate on Redis, synchronization problems can occur.

To solve the synchronization problem, one way is to use Lua scripts to encapsulate these operations to ensure atomicity. Another way is to use distributed locks, which will be introduced in Lessons 13 and 14.

Below is a debugged Lua script that shows some key comparison actions and the HINCRBY command, which can be an atomic operation.

local falseRet = "0"
local n = tonumber(ARGV[1])
local key = KEYS[1]
local goodsInfo = redis.call("HMGET",key,"total","alloc")
local total = tonumber(goodsInfo[1])
local alloc = tonumber(goodsInfo[2])
if not total then
    return falseRet
end
if total >= alloc + n  then
    local ret = redis.call("HINCRBY",key,"alloc",n)
    return tostring(ret)
end
return falseRet

The corresponding seckill code is as follows. Since we are using String serialization, the deducted quantity of the inventory is first converted to a string, and then the Lua script is called.

public int secKill(String id, int number) {
    String key = getKey(id);
    Object alloc = redisTemplate.execute(script, Arrays.asList(key), String.valueOf(number));
    return Integer.valueOf(alloc.toString());
}

Execute the testSeckill method in the repository. Start 1000 threads to simulate the seckill of 100 resources. You can see that 100 records are generated, and the other threads return 0, indicating that they did not successfully seckill.

Drawing 6.png

Cache Penetration, Cache Breakdown, and Cache Avalanche #

Ignoring the seckill scenario, let’s take a look at the three major problems that distributed caching systems may encounter: cache penetration, cache breakdown, and cache avalanche.

1. Cache Penetration #

The first major problem is cache penetration. This concept is closely related to the cache hit rate mentioned in the previous lesson. If the hit rate is low, the pressure will be concentrated on the database persistence layer.

If we can find the relevant data, we can cache it. However, the problem is that “in this request, neither the cache nor the persistence layer is hit,” which is called cache penetration.

Drawing 7.png

For example, in a login system, if there is an external attack and someone keeps trying to log in using non-existent users, these users are virtual and cannot be effectively cached. So each time a query is made to the database, it will cause performance problems.

There are multiple solutions to this problem, so let’s briefly introduce them.

The first solution is to cache null objects. Can’t find the data in the persistence layer, right? Then we can set the result of this request as null and put it into the cache. By setting a reasonable expiration time, we can ensure the security of the backend database.

Caching null objects will occupy additional cache space and there will be a time window of data inconsistency. So the second solution is to use Bloom filters to handle large amounts of data with regular keys.

Whether a record exists or not is a Boolean value, and it only takes 1 bit to store it. A Bloom filter can compress this “yes/no” operation into one data structure. For example, phone numbers and user genders are suitable for Bloom filters.

2. Cache Breakdown #

Cache penetration also refers to the situation where user requests fall on the database. In most cases, it is caused by the expiration of a large number of cached items at the same time.

We usually set an expiration time for the data in the cache. If a large amount of data is retrieved from the database at a certain moment and the same expiration time is set, they will expire at the same time, causing a cache penetration.

For hot data, we can set it to never expire or update its expiration time when accessed. For batched cached items, it is also advisable to allocate a relatively average expiration time to avoid simultaneous expiration.

3. Cache Avalanche #

The term “avalanche” looks terrifying, and the actual situation is indeed quite serious. The cache is designed to speed up the system, while the backend database is only a backup of the data and not a high-availability alternative.

When the cache system fails, traffic will instantly shift to the backend database. Before long, the database will be overwhelmed by the high traffic, leading to a cascade failure that is aptly called an avalanche.

Drawing 9.png

Building a highly available cache is crucial. Redis provides the Master-Slave and Cluster modes, with Cluster mode being simpler to use. Each shard in the Cluster mode can also function as a Master-Slave, ensuring high availability.

In addition, we should have a rough evaluation of the database’s performance bottleneck. If the cache system goes down, the requests that go to the database can be intercepted using a traffic throttling component.

Cache Consistency #

After introducing the cache component, another difficult problem arises: cache consistency.

Let’s first look at how the problem occurs. For a cache item, there are four common operations: write, update, read, and delete.

  • Write: Since the cache and the database are two different components, there’s a possibility of only one write succeeding when dual write is involved, resulting in inconsistent data.
  • Update: The situation is similar for updates, as they need to be performed on two different components.
  • Read: When reading, we need to ensure that the information retrieved from the cache is the latest and consistent with that in the database.
  • Delete: How do we delete the data in the cache when deleting a database record?

Given that business logic is often complex, updates are expensive. For example, a user’s balance is calculated based on a series of assets. If the cache is refreshed every time any of these associated assets changes, the code structure will become very messy and difficult to maintain.

I recommend using a trigger-based cache consistency approach that employs lazy loading, which simplifies cache synchronization:

  • When reading from the cache, if the related data is not found in the cache, the relevant business logic is executed to construct the cache data and store it in the cache system.
  • When any resource related to a cache item changes, the corresponding cache item is first deleted, and then the resource is updated. Even if the resource update fails at this point, it won’t pose a problem.

This operation has a clear advantage aside from its simplicity. The cache is only loaded into the cache system when it is actually needed. If a new cache is created or a resource is updated for each modification, the cache will contain a large amount of cold data.

However, there is still a problem with this approach. The next scenario I am about to introduce is also frequently mentioned in interviews.

As previously mentioned, the cache deletion action and the database update action obviously occur in different transactions. If one request deletes the cache and another request arrives at the same time, finding that there is no related cache item and subsequently loading a copy from the database into the cache system, the database update operation will be completed afterward, resulting in inconsistency between the contents of the database and the cache.

The following diagram visually explains this inconsistency. At this point, both cache read operation B and subsequent read operations will retrieve incorrect cache values.

Drawing 10.png

In interviews, as long as you point out this issue, the interviewer will give you a thumbs-up.

You can use distributed locks to solve this problem. By using locks to isolate cache operations and database deletion operations from other cache read operations, the issue can be resolved. Generally, read operations do not require locks and will retry and wait when encountering a lock until a timeout occurs.

Summary #

This lesson contains a lot of content, but it is very important. If you have participated in interviews at large internet companies, you will find that many of the frequently asked questions are covered in this lesson, so it is worth pondering over.

This and the previous lesson are both centered around caching, and there are many similar topics between them. Redis is currently the most widely used for distributed caching. We first briefly introduced the differences between Redis and Memcached, and then discussed the usage of Redis in Spring Boot projects. We focused on the Lua code for inventory deduction, which is a core feature in the scenario of flash sales. This piece of code makes the conditional judgment and deduction command an atomic operation.

Redis APIs are easy to use and fast, but they also introduce many issues. If these exceptional scenarios cannot be resolved, the value of Redis will be greatly reduced. Therefore, we mainly discussed cache penetration, cache breakdown, and cache avalanche scenarios, with a particular emphasis on cache consistency and its solution.

In the next section, I will introduce object pooling, a performance optimization method very similar to caching. It utilizes object reuse to increase efficiency. See you in the next lesson.