37 Data Distribution Optimization How to Address Data Skew

37 Data Distribution Optimization How to Address Data Skew #

In a sharded cluster, data is distributed and saved on different instances according to certain distribution rules. For example, when using Redis Cluster or Codis, data is first hashed using the CRC algorithm and then modulo is taken with respect to slots (logical slots). Moreover, the slots are allocated to different instances by system administrators. This way, data is saved on the respective instances.

Although this method is relatively simple to implement, it can easily lead to a problem: data skew.

There are two types of data skew:

  • Data Volume Skew: In some cases, the data distribution on instances is imbalanced, and one instance has significantly more data than others.
  • Data Access Skew: Although the data volume on each cluster instance is similar, one instance contains hot data that is accessed frequently.

If data skew occurs, the instance(s) that hold a large amount of data or hot data will experience increased processing pressure, resulting in slower speeds and even potential exhaustion of memory resources, leading to a crash. This is something we need to avoid when using sharded clusters.

In today’s lesson, I will discuss how these two types of data skew occur and how we can deal with them.

Causes and Solutions for Data Skewness #

First, let’s take a look at the causes and solutions for data skewness.

When data skewness occurs, data is distributed unevenly across multiple instances of the sharding cluster, with a large amount of data concentrated on one or a few instances, as shown in the diagram below:

So, how does data skewness happen? There are three main reasons for this: the presence of big keys on certain instances, unbalanced slot allocation, and hash tags. Next, let’s analyze each of these reasons one by one, and I will also explain the corresponding solutions.

Big Keys causing skewness #

The first reason is that a certain instance happens to have big keys. Big keys have large values (String type) or store a large number of elements (Collection type), which increases the amount of data on that instance and consumes more memory resources.

Moreover, operations on big keys generally block the I/O threads of an instance. If the access volume of big keys is high, it will affect the processing speed of other requests on that instance.

In fact, big keys have been repeatedly mentioned as a key point in our course. To avoid data skewness caused by big keys, a fundamental solution is to try to avoid storing too much data in the same key-value pair at the business layer.

In addition, if the big key happens to be a collection type, we have another method, which is to split the big key into many small collection-type data and store them in different instances.

Let me give you an example. Suppose a Hash type collection “user:info” stores the information of one million users, which is a big key. In this case, we can split this collection into 10 small collections based on the range of user IDs. Each small collection only stores the information of 100,000 users (for example, small collection 1 stores information of users with IDs from 1 to 100,000, and small collection 2 stores information of users with IDs from 100,001 to 200,000). By doing this, we can divide a big key into smaller pieces and distribute it, avoiding the access pressure that a big key would bring to a single sharded instance.

It should be noted that when the access volume of big keys is high, it will also cause data access skewness. I will explain how to deal with it in detail later.

Next, let’s take a look at the second reason that leads to data skewness: unbalanced slot allocation.

Data Skewness Caused by Unbalanced Slot Allocation #

If the cluster operators do not allocate slots evenly, a large amount of data will be assigned to the same slot, and each slot can only be distributed to one instance. This will cause a concentration of data on one instance and result in data skewness.

Let me use Redis Cluster as an example to explain unbalanced slot allocation.

Redis Cluster has a total of 16384 slots. Let’s say there are 5 instances in the cluster, and Instance 1 has higher hardware specifications. In this case, the operators may assign more slots to Instance 1 to fully utilize its resources.

However, we don’t know the mapping between data and slots. This practice may lead to a large amount of data being mapped to slots on Instance 1, causing data skewness and access pressure on Instance 1.

To address this issue, we can follow operational standards to avoid assigning too many slots to the same instance before allocation. If the cluster already has assigned slots, we can check the specific allocation relationship between slots and instances to determine if there are too many slots concentrated on the same instance. If so, we can migrate some slots to other instances to avoid data skewness.

Different ways to check slot allocation on different clusters: For Redis Cluster, we can use the CLUSTER SLOTS command. For Codis, we can check on the codis dashboard.

For example, let’s execute the CLUSTER SLOTS command to check the slot allocation. The command returns the following result: Slots 0 to 4095 are allocated to Instance 192.168.10.3, and Slots 12288 to 16383 are allocated to Instance 192.168.10.5.

127.0.0.1:6379> cluster slots
1) 1) (integer) 0
   2) (integer) 4095
   3) 1) "192.168.10.3"
      2) (integer) 6379
2) 1) (integer) 12288
   2) (integer) 16383
   3) 1) "192.168.10.5"
      2) (integer) 6379

If there are too many slots on a particular instance, we can use migration commands to move these slots to other instances. In Redis Cluster, we can use three commands to complete slot migration.

  1. CLUSTER SETSLOT: Sets the target instance for a slot to be migrated to, the source instance to migrate out of, and the instance to which the slot belongs.
  2. CLUSTER GETKEYSINSLOT: Retrieves a certain number of keys in a slot.
  3. MIGRATE: Migrates a key from the source instance to the target instance.

Let me provide you with an example to help you understand how to use these three commands.

Suppose we want to migrate Slot 300 from the source instance (ID 3) to the target instance (ID 5). How do we do that? Actually, we can divide it into 5 steps.

Step 1: First, we execute the following command on the target instance 5 to set the source instance of Slot 300 to instance 3, indicating that we want to migrate from instance 3 to Slot 300.

CLUSTER SETSLOT 300 IMPORTING 3

Step 2: On the source instance 3, we set the target instance of Slot 300 to 5, indicating that Slot 300 will be migrated to instance 5.

CLUSTER SETSLOT 300 MIGRATING 5

Step 3: Get 100 keys from Slot 300. Since the number of keys in the Slot may be large, we need to execute the following command multiple times on the client to obtain and migrate the keys in batches.

CLUSTER GETKEYSINSLOT 300 100

Step 4: Migrate key1 from the obtained 100 keys to the target instance 5 (IP: 192.168.10.5) and set the migration database to database 0. Also, set the migration timeout. We repeat the MIGRATE command to migrate all 100 keys.

MIGRATE 192.168.10.5 6379 key1 0 timeout

Lastly, we repeat steps 3 and 4 until all keys in the Slot are migrated.

Starting from Redis 3.0.6, you can also use the KEYS option to migrate multiple keys at once (key1, key2, key3), which can improve migration efficiency.

MIGRATE 192.168.10.5 6379 "" 0 timeout KEYS key1 key2 key3

For Codis, we can use the following command to migrate data. We set the connection address of the dashboard component to ADDR and migrate Slot 300 to codis server group 6.

codis-admin --dashboard=ADDR -slot-action --create --sid=300 --gid=6

In addition to bigkey and uneven Slot allocation causing data skew, there is another reason for skew, which is the use of Hash Tags for data sharding.

Skew Caused by Hash Tags #

Hash Tags refer to a pair of curly braces {} added to a key-value pair key. These braces enclose a part of the key, and when the client calculates the CRC16 value of the key, it only calculates the CRC16 value of the key content inside the Hash Tags. Without Hash Tags, the client calculates the CRC16 value of the entire key.

For example, suppose the key is user:profile:3231, and we use 3231 as the Hash Tag. In this case, the key becomes user:profile:{3231}. When the client calculates the CRC16 value of this key, it only calculates the CRC16 value of 3231. Otherwise, the client would calculate the CRC16 value of the whole “user:profile:3231”.

The benefit of using Hash Tags is that if different keys have the same Hash Tag content, the data corresponding to these keys will be mapped to the same Slot and allocated to the same instance.

The table below shows an example of data being mapped to the same Slot when using Hash Tags. Take a look.

Both user:profile:{3231} and user:order:{3231} have the same Hash Tag, which is 3231. Their CRC16 calculation values modulo 16384 are also the same, so they are mapped to the same Slot 1024. The same mapping result applies to user:profile:{5328} and user:order:{5328}.

So, in what scenarios are Hash Tags generally used? They are mainly used in Redis Cluster and Codis, which support transaction operations and range queries. Redis Cluster and Codis do not inherently support cross-instance transaction operations and range queries. When these operations are required in business applications, the data needs to be first read into the business layer for transaction processing or each instance needs to be queried individually to obtain the results of range queries.

This process is cumbersome. Therefore, we can use Hash Tags to map the data that needs to be transactionally operated or range queried to the same instance, making it easy to implement transactions or range queries.

However, the potential issue with using Hash Tags is that a large amount of data may be concentrated on one instance, resulting in data skew and an unbalanced load in the cluster. So, how do we deal with this problem? We need to strike a balance between the requirements of range queries and transaction execution and the access pressure caused by data skew.

My suggestion is, if using Hash Tags for data sharding will bring significant access pressure, it is preferable to avoid data skew and not use Hash Tags for data sharding. Transaction and range queries can still be executed on the client side, while data skew can cause instability in instances and result in service unavailability.

Alright, now we have a complete understanding of the causes and countermeasures of data skew. Next, let’s take a look at the causes and countermeasures of data access skew.

Causes and Countermeasures of Data Skew #

The root cause of data skew is the presence of hot data on instances (such as hot news content in a news application, popular product information in e-commerce promotions, etc.).

Once hot data exists on an instance, the request traffic to that instance will be much higher than other instances, resulting in tremendous access pressure, as shown in the figure below:

So, how should we deal with it?

Unlike data skew, hot data usually refers to one or a few pieces of data. Therefore, simply reallocating slots cannot solve the problem of hot data.

Generally, hot data is mainly used for read operations. In this case, we can use the multiple copies of hot data method to deal with it.

The specific approach of this method is to replicate the hot data multiple times and add a random prefix to the key of each data copy. This ensures that they are not mapped to the same slot. As a result, multiple copies of hot data can simultaneously handle requests, and these copies of data with different keys will be mapped to different slots. When assigning instances to these slots, we should also make sure they are assigned to different instances. This way, the access pressure of hot data is distributed to different instances.

Here, one thing to note is that the multiple copies of hot data method can only be used for read-only hot data. If the hot data is both read and write, it is not suitable to use the multiple copies method because ensuring data consistency among multiple copies will incur additional overhead.

For hot data that is both read and write, we need to increase the resources of the instance itself, such as using higher configuration machines to handle the large amount of access pressure.

Summary #

In this lesson, I introduced two situations of data skew: data volume skew and data access skew.

There are three main reasons for data volume skew:

  1. The presence of bigkeys in the data, leading to an increase in the data volume of a particular instance.
  2. Unequal manual allocation of slots, resulting in a large amount of data on one or more instances.
  3. The use of Hash Tags, causing the data to be concentrated on certain instances.

The main reason for data access skew is the existence of hot data, which leads to a concentration of access requests on the instance where the hot data is located.

To deal with data skew issues, I introduced four methods, each corresponding to one of the four reasons for data skew. I have summarized them in the table below, which you can refer to.

data-skew-methods

Of course, if data skew has already occurred, we can alleviate its impact through data migration. Redis Cluster and Codis clusters provide commands for viewing slot allocation and manually migrating slots, which you can apply.

Finally, regarding the instance resource configuration of the cluster, I have another suggestion for you: when building a sharded cluster, try to use instances with similar configurations (such as keeping the instance memory configuration the same). This can avoid allocating different numbers of slots on different instances due to resource imbalance.

One question per lesson #

As usual, I have a small question for you. When there is data access skew and the hot data suddenly expires, while the data in Redis is cached and the final value of the data is stored in the backend database, what problems will occur?

Feel free to write your thoughts and answers in the comments section. Let’s discuss and exchange ideas together. If you find today’s content helpful, feel free to share it with your friends or colleagues. See you in the next lesson.