18 How to Respond to the Wave of Latency Slowdown in Redis Before

18 How to Respond to the Wave of Latency Slowdown in Redis - Before #

In the practical deployment of Redis applications, there is a very serious issue, which is when Redis suddenly becomes slow. Once this problem occurs, it not only directly affects the user experience but may also affect “others,” which refers to other systems in the same business system as Redis, such as databases.

Let’s take a small example. In a spike-sale scenario, when Redis becomes slow, a large number of user orders will be delayed. In other words, users submit order requests but do not receive any response. This will result in a very poor user experience and may even lead to user churn.

Furthermore, in actual production environments, Redis often serves as a component of the business system (e.g., as a cache or database). Once the request latency on Redis increases, it may trigger a series of “chain reactions” within the business system.

Let me explain with a small example that includes the business logic of Redis.

The application server (App Server) needs to perform a transactional operation, which includes executing a write transaction on MySQL, inserting a marker on Redis, and sending a completion message to the user through a third-party service.

All three operations need to ensure transactional atomicity. Therefore, if Redis’s latency increases at this time, it will impede the execution of the entire transaction on the App Server side. This transaction cannot be completed, which will also cause the resources occupied by the write transaction on MySQL to be unable to be released, thus blocking other requests accessing MySQL. It is evident that slow Redis will result in serious chain reactions.

I believe that many people have encountered this problem, so how do we specifically solve it?

At this time, it is important not to resort to “desperate measures.” Without a set of effective countermeasures, most of the time, we can only try various methods without achieving any progress. In Lecture 16 and Lecture 17, we learned about the potential blocking points that can cause Redis to become slow, as well as the corresponding solutions, which are asynchronous thread mechanisms and CPU core binding. In addition to these, there are other factors that can slow down Redis.

In the next two lectures, I will introduce to you how to systematically deal with the problem of Redis becoming slow. I will specifically explain the problem identification, systematic investigation, and response strategies in these three aspects. After completing these two lectures, you will definitely be able to systematically solve the issue of Redis becoming slow.

Does Redis really slow down? #

Before solving the problem, we need to first understand how to determine if Redis is really slowing down.

The most direct method is to check the response latency of Redis.

Most of the time, Redis has a very low latency, but at certain times, some Redis instances may experience high response latency, sometimes even reaching several seconds to more than ten seconds. However, this high latency only lasts for a short duration, which is also known as “latency spike”. When you notice that the execution time of Redis commands suddenly increases to several seconds, you can basically conclude that Redis is slowing down.

This method looks at the absolute value of Redis latency. However, the absolute performance of Redis itself varies in different software and hardware environments. For example, in my environment, when the latency is 1ms, I consider Redis to be slow. But if you have a high hardware configuration, then in your runtime environment, you may consider Redis to be slow when the latency is only 0.2ms.

Therefore, let’s talk about the second method, which is based on the baseline performance of Redis in the current environment to make judgments. The so-called baseline performance refers to the basic performance of a system under low pressure and no interference, which is determined solely by the current software and hardware configuration.

You may ask, how do we determine the baseline performance? Are there any good methods?

In fact, starting from version 2.8.7, the redis-cli command provides an option called --intrinsic-latency, which can be used to monitor and statistically measure the maximum latency during the test. This latency can be used as the baseline performance of Redis. The test duration can be specified using the parameter of the --intrinsic-latency option.

As an example, if we run the following command, it will print the maximum latency detected within 120 seconds. As seen below, the maximum latency is 119 microseconds, indicating that the baseline performance is 119 microseconds. Usually, running for 120 seconds is enough to monitor the maximum latency, so we can set the parameter to 120.

./redis-cli --intrinsic-latency 120
Max latency so far: 17 microseconds.
Max latency so far: 44 microseconds.
Max latency so far: 94 microseconds.
Max latency so far: 110 microseconds.
Max latency so far: 119 microseconds.

36481658 total runs (avg latency: 3.2893 microseconds / 3289.32 nanoseconds per run).
Worst run took 36x longer than the average latency.

It should be noted that the baseline performance is related to the current operating system and hardware configuration. Therefore, we can combine it with the latency during runtime to further determine if Redis performance has slowed down.

Generally, you need to compare the runtime latency with the baseline performance. If the observed runtime latency is 2 times or more than the baseline performance of Redis, you can conclude that Redis has slowed down.

The determination of baseline performance is particularly important for Redis running in virtualized environments. This is because in virtualized environments like virtual machines or containers, the virtualization software layer itself introduces a certain performance overhead compared to physical machines, so the baseline performance will be higher. The following test results show the baseline performance measured when running Redis on a virtual machine.

$ ./redis-cli --intrinsic-latency 120
Max latency so far: 692 microseconds.
Max latency so far: 915 microseconds.
Max latency so far: 2193 microseconds.
Max latency so far: 9343 microseconds.
Max latency so far: 9871 microseconds.

As we can see, due to the overhead of the virtualization software itself, the baseline performance in this case has reached 9.871ms. If the runtime latency of this Redis instance is 10ms, it should not be considered a performance slowdown, because in this case, the runtime latency is only 1.3% higher than the baseline performance. If you are not familiar with the baseline performance and you see a relatively high runtime latency, you may mistakenly think that Redis has slowed down.

However, we usually access Redis services through clients and network. In order to avoid the influence of the network on the baseline performance, the aforementioned command needs to be run directly on the server side. This means that we only consider the impact of the server-side software and hardware environment.

If you want to understand the impact of the network on Redis performance, a simple method is to use tools like iPerf to measure the network latency from the Redis client to the server. If this latency is several tens of milliseconds or even several hundreds of milliseconds, it means that there may be other applications with heavy traffic running in the network environment where Redis is running, resulting in network congestion. In this case, you need to coordinate with network operations and adjust the traffic allocation in the network.

How to Deal with Slow Redis? #

After the previous step, you should be able to determine whether Redis is slow. Once you identify the slowness, the next step is to find the cause and solve the problem, which is actually an interesting diagnostic process.

At this point, you are like a doctor, and Redis is a patient. When treating a patient, you need to understand the mechanisms of the human body and also be aware of external factors that can affect the body, such as unhealthy food and bad emotions. Then you need to take CT scans, electrocardiograms, etc. to find the cause, and finally determine the treatment plan.

In diagnosing the “slow Redis” symptom, it is also the same. Based on your understanding of the working principles of Redis itself, and considering the key mechanisms of external systems such as the operating system, storage, and network that interact with it, along with the help of some auxiliary tools, you can locate the cause and develop effective solutions.

The diagnosis of a doctor usually follows a certain order. Similarly, Redis performance diagnosis also follows a certain order, which is the key factors that affect Redis. You should remember the following diagram that we drew in the first lesson. You should pay attention to the red module I added to the diagram, which is the three key elements that affect Redis performance: Redis’s own operational characteristics, file system, and the operating system.

Redis Architecture

Next, I will start with these three key elements and introduce you some practical experiences on troubleshooting and solving problems based on different factors in actual scenarios. In this lesson, I will first introduce the impact of Redis’s own operational characteristics, and in the next lesson, we will focus on studying the impact of the operating system and file system.

Impact of Redis’s Own Operational Characteristics #

First, let’s learn about the impact of Redis’s key-value command operations on latency performance. I will focus on two types of key operations: slow query commands and expired key operations.

1. Slow Query Commands

Slow query commands refer to commands that run slowly in Redis, which will increase the latency of Redis. Redis provides many command operations, not all of which are slow. This depends on the complexity of the command operations. Therefore, we need to know the complexity of different Redis commands.

For example, when the value type is String, the GET/SET operations mainly operate on the hash table index of Redis. The complexity of this operation is basically fixed, that is, O(1). However, when the value type is Set, the complexity of SORT, SUNION/SMEMBERS operations is O(N+M*log(M)) and O(N) respectively. Here, N is the number of elements in the set, and M is the number of elements returned by the SORT operation. This complexity increases a lot. The Redis official documentation provides an introduction to the complexity of each command. When you need to know the complexity of a certain command, you can directly check it.

So, how to deal with this problem? Here, I will provide you with some suggestions and solutions for troubleshooting, which is the first method for today.

When you find that Redis performance is slow, you can use the Redis log or the latency monitor tool to check the slow requests. Based on the specific command associated with the request and the official documentation, confirm whether complex slow query commands are used.

If there are indeed a large number of slow query commands, there are two ways to handle them:

Use other efficient commands as substitutes. For example, if you need to return all members of a SET, do not use the SMEMBERS command, but use the SSCAN command to return them iteratively multiple times to avoid returning a large amount of data at once, which may cause thread blocking.
When you need to perform sorting, intersection, and union operations, you can perform them on the client side instead of using SORT, SUNION, and SINTER commands to avoid slowing down the Redis instance. Of course, if the business logic requires the use of slow query commands, then you should consider using a better performing CPU to complete the query commands faster and avoid the impact of slow queries.

There is also a slow query command that is easily overlooked, which is KEYS. It is used to return all keys that match a given pattern. For example, the following command returns all keys that contain the string “name”:

redis> KEYS *name*
1) "lastname"
2) "firstname"

KEYS command has high latency as it needs to traverse all key-value pairs. If you use it without understanding its implementation, it can slow down the performance of Redis. Therefore, KEYS command is generally not recommended for use in production environments.

2. Expired key operation

Next, let’s look at the automatic deletion mechanism for expired keys. It is a common mechanism used by Redis to reclaim memory space. It is widely used but can cause Redis operations to block and slow down performance, so you need to be aware of its impact on performance.

Redis key-value pairs can be set with expiration time. By default, Redis deletes some expired keys every 100 milliseconds, following these steps:

Sample ACTIVE_EXPIRE_CYCLE_LOOKUPS_PER_LOOP keys and delete all expired keys among them.
If more than 25% of the keys have expired, repeat the deletion process until the ratio of expired keys drops below 25%.

ACTIVE_EXPIRE_CYCLE_LOOKUPS_PER_LOOP is a parameter of Redis with a default value of 20, which means that there are basically 200 expired keys being deleted per second. This strategy is helpful for clearing expired keys and releasing memory space. Deleting 200 expired keys per second does not have a significant impact on Redis.

However, if the second step of the above algorithm is triggered, Redis will continue to delete to free up memory space. Note that the deletion operation is blocking (Redis 4.0 onwards can use asynchronous thread mechanism to reduce blocking impact). Once this condition is triggered, Redis threads will keep executing deletions, which prevents normal servicing of other key-value operations and further increases the latency of other key-value operations, thus making Redis slower.

So how is the second step of the algorithm triggered? One important source is frequent use of EXPIREAT command with the same timestamp parameter. This will cause a large number of keys to expire at the same time within the same second.

Now, I will provide the second troubleshooting suggestion and solution.

You should check whether the business code uses the same UNIX timestamp when using the EXPIREAT command to set key expiration time and whether the EXPIRE command is used to set the same expiration time for a batch of keys. Because both of these cases can cause a large number of keys to expire at the same time, which slows down performance.

When encountering this situation, don’t consider it troublesome. First, you need to determine the expiration time parameters for EXPIREAT and EXPIRE based on the actual business requirements. Secondly, if a batch of keys does need to expire at the same time, you can add a random number within a certain range to the expiration time parameters of EXPIREAT and EXPIRE. This ensures that the keys are deleted within a nearby time range and avoids the pressure caused by simultaneous expiration.

Summary #

In this lesson, I first introduced to you the significant impact of Redis performance degradation and I hope you can take this issue seriously. I focused on the methods of determining Redis slowness, one is to observe the response latency, and the other is to look at baseline performance. Additionally, I also provided two methods for troubleshooting and resolving Redis slowness issues:

Start troubleshooting from slow query commands and replace them according to business requirements.
Check the expiration time setting of expired keys and set different expiration times based on actual usage needs.

Performance diagnosis is usually challenging, so we must not “randomly search” without clear objectives. The content introduced in this lesson is the method of troubleshooting and resolving Redis performance degradation. You must follow this method step by step to quickly identify the cause.

Of course, to make good use of Redis, besides understanding the principles of Redis itself, it is also crucial to understand the key mechanisms of the underlying systems that interact with Redis, including the operating system and file system. In most cases, some difficult-to-troubleshoot issues are caused by the mismatch between Redis usage or settings and the working mechanisms of the underlying systems. In the next lesson, I will focus on introducing the impact of file systems and operating systems on Redis performance, as well as corresponding troubleshooting methods and solutions.

One question per lesson #

In this lesson, I mentioned the KEYS command because it has high complexity and can easily cause Redis thread blocking, making it unsuitable for production environments. However, the functionality provided by the KEYS command itself is frequently needed by upper-level business applications, which is to return keys matching a given pattern.

Please consider, in Redis, what other commands can be used instead of the KEYS command to achieve the same functionality? Do these commands affect Redis’s performance and make it slower?

Feel free to write down your thoughts and answers in the comments section. Let’s discuss and learn together. If you find it helpful, you are welcome to share today’s content with your friends.