28 How to Build a High Performance and Easily Scalable Redis Cluster

28 How to Build a High-Performance and Easily Scalable Redis Cluster #

From the previous lesson, we learned that replication can increase the read performance of Redis nodes by N times, and clustering can increase the write performance of Redis by N times through distributed solutions. In addition to improving performance, Redis clusters can also provide larger capacity and improve the availability of resource systems.

There are three main distributed solutions for Redis clusters. They are client-side partitioning, proxy partitioning, and native Redis Cluster partitioning.

Client-Side Partitioning #

In the client-side partitioning solution, the client determines which Redis shard to store the data or get the data from. Its core idea is to use a hash algorithm to map different keys to specific Redis shards. For a single key request, the client directly hashes the key, determines the Redis shard, and then makes the request. For a request with multiple keys, the client first classifies these keys into different hash shards, splitting the request into multiple requests, and then requests different hash shard nodes separately.

Clients distribute data through hash algorithms, usually using modulo hash, consistent hash, and interval distributed hash. The first two hash algorithms have been analyzed in detail in previous lessons, so they will not be repeated here. As for interval distributed hash, it is actually a variant of modulo hash. Modulo hash allocates storage nodes based on hash values after hash and modulo calculation, while interval hash divides the hash into multiple intervals after hash calculation, and then assigns these intervals to storage nodes. For example, after hashing, split it into 1024 hash points, and then assign 0~511 as shard 1 and 512~1023 as shard 2.

For client-side partitioning, since the Redis cluster has multiple master shards, and each master has multiple slaves, each Redis node has a separate IP and port. If a master fails and needs to switch to another master, or if there is high read pressure and new slaves need to be added, these will involve changes to the cluster’s storage nodes, requiring the client to switch connections.

To avoid frequent IP list changes on the client side, you can use DNS to manage the master-slave cluster. Different DNS domain names are used for the masters and slaves of each shard in the Redis cluster. The client obtains all the IPs under the domain name through domain name resolution and then accesses the cluster nodes. Since there are multiple slaves under each master shard, the client needs to balance the load between multiple slaves. You can establish connections with the slaves based on their weights and then use these connections in a round-robin manner to access the slaves according to their weights.

In the DNS access mode, the client needs to asynchronously and periodically probe the master-slave domain names. If IP changes are discovered, it should establish connections with the new nodes promptly and close the old connections. In this way, when the master library fails and needs to be switched, or when the slave library needs to be added or removed, any changes in the master-slave relationship of any shard only require the operations and maintenance or management process to change the IP list under DNS. The client does not need to make any configuration changes and can switch and access normally.

The advantages of the client-side partitioning solution are that the partitioning logic is simple, the configuration is simple, there is no need for coordination between client nodes or Redis nodes, and it has strong flexibility. Moreover, the client directly accesses the corresponding Redis node without any additional steps, resulting in high efficiency. However, this solution is not convenient for scaling. On the Redis end, it can only be scaled by multiples or by pre-allocating enough shards. On the client side, each time the shards are split, the business client needs to modify the distribution logic and restart.

Proxy-Side Partitioning #

The proxy-side partitioning solution refers to the client sending requests to the proxy request proxy component, the proxy parsing the client request, and routing the request to the correct Redis node, and then waiting for the Redis response. Finally, the result is returned to the client.

If a request contains multiple keys, the proxy needs to split the multiple keys of the request into multiple requests according to the sharding logic, and then request different Redis shards separately. The proxy then waits for the Redis responses. After all the splitting responses arrive, it aggregates and assembles them and returns them to the client. In the entire process, the proxy is responsible for receiving and parsing requests, calculating and routing keys, and reading, parsing, and assembling results. If there are changes in the master-slave relationship or scaling in the system, they can be handled by the proxy, and the business client is basically unaffected. There are two common proxy partitioning schemes. The first is a simple partitioning scheme based on Twemproxy, and the second is a partitioning scheme based on Codis that supports smooth data migration.

Twemproxy is an open-source component developed by Twitter. It is a proxy component that supports Redis and Memcached protocol access. I have previously provided a detailed introduction to its principles and implementation architecture in the practical application of distributed Memcached, so I will not repeat it here. In general, Twemproxy is simple to implement and highly stable. It can meet the needs of business scenarios with low traffic and infrequent scaling. However, Twemproxy is a single-process, single-threaded model, so for multi-key multi-requests, it needs to split and aggregate the requests, resulting in lower processing performance. Additionally, when scaling the backend Redis resources, i.e., adding or reducing shards, you need to modify the configuration and restart, which cannot achieve smooth scaling. Moreover, Twemproxy only has one default proxy component and no backend management, making various operation and maintenance changes less convenient.

Codis, on the other hand, is a more mature distributed Redis solution. There is almost no difference between connecting to Codis-proxy and connecting to a single Redis instance when it comes to business client access. In addition to automatically parsing and distributing requests, Codis can also perform data migration online, making it very convenient to use.

The Codis system mainly consists of Codis-server, Codis-proxy, Codis-dashboard, and ZooKeeper.

Codis-server is the storage component of Codis, which is an extension based on Redis and adds slot support and data migration capabilities. All data is stored in the pre-allocated 1024 slots, and synchronous or asynchronous data migration can be performed according to the slots.
Codis-proxy handles client requests, parses business requests, and routes them to the backend Codis-server group. Each server group of Codis is equivalent to a Redis shard, consisting of one master and N replicas.
ZooKeeper is used to store metadata, such as the nodes of the proxy and the routing table for data access. In addition to ZooKeeper, Codis also supports other components such as etcd for metadata storage and notifications.
Codis-dashboard is the management interface of Codis, which can be used to manage data and proxy node addition or deletion, as well as perform data migration operations. The commands for various changes in the dashboard are distributed through ZooKeeper.
Codis provides a feature-rich management interface, which makes it easy to monitor and operate the entire cluster.

The advantage of the proxy partitioning scheme is that it decouples the client access logic from the Redis distribution logic, making it convenient and simple for business access. When the resources change or scaling is needed, you only need to modify a limited number of proxies, without the need to adjust a large number of business client endpoints.

However, in the proxy partitioning scheme, the request needs to be forwarded through the proxy, adding an extra hop to the access, resulting in a performance loss. Generally, the performance loss is around 5-15%. Additionally, the system architecture becomes more complex with the addition of an extra layer of proxy.

Redis Cluster Partitioning #

The Redis community edition introduced the Cluster strategy after version 3.0, generally referred to as the Redis-Cluster scheme. Redis-Cluster manages data read and write based on slots. A Redis-Cluster consists of 16384 slots. Each Redis shard is responsible for a portion of the slots. When the cluster starts, all the slots are assigned to different nodes as needed. After the cluster is running, based on the slot assignment strategy, keys are hashed and routed to the corresponding nodes for access.

With changes in the business access model, some Redis nodes may experience excessive pressure and unbalanced access. In this case, slots can be migrated within the Redis shard nodes to balance the access. If the business continues to grow and the data volume and TPS become too high, some slots of the Redis nodes can be migrated to new nodes to increase the number of Redis-Cluster shards and scale the entire Redis resources to improve the capacity and read/write capabilities of the entire cluster.

When starting a Redis cluster, before accessing data read and write, you can use Redis’s Cluster addslots command to assign the 16384 slots to different Redis shard nodes. You can also use the Cluster delslots command to remove slots from a specific node, and the Cluster flushslots command to clear all slot information from a node to adjust the slots. Redis Cluster is a decentralized architecture where each node records the topological distribution of all slots. This allows the Redis server to redirect clients to the correct Redis node if the key is distributed to the wrong node.

In Redis Cluster, different Redis shard nodes connect with each other through the gossip protocol. The advantage of using gossip is that there is no central control node, which means updates are not affected by a central node. It allows managing notifications through any node in the cluster. However, the downside is that metadata updates may have some latency, and cluster operations are notified to all Redis nodes after a certain delay. Since Redis Cluster uses the gossip protocol for inter-node communication, scaling up or down can be done by sending a “Cluster meet” command to any node in the cluster to add a new node. The new node will be immediately propagated throughout the entire cluster. This operation only requires one chain of nodes to reach all the nodes in the cluster and does not require meeting all nodes in the cluster, making it convenient to operate.

In Redis-Cluster, access to keys requires the use of a smart client. The client first sends a request to a Redis node. After receiving and parsing the command, Redis calculates the hash of the key to determine the slot. The calculation formula is to perform a crc16 hash on the key and then perform a bitwise AND operation with 16383. If Redis finds that the key’s slot is local, it directly executes the command and returns the result.

If Redis finds that the key’s slot is not local, it returns a “moved” exception response, along with the slot of the key and the host and port of the correct Redis node for that slot. The client parses the response and obtains the correct IP and port of the node, and then redirects the request to the correct Redis node to complete the request. To speed up access, the client needs to cache the mapping between slots and Redis nodes so that it can access the correct node directly, thereby improving access performance.

Redis-Cluster provides a flexible node scaling solution, allowing dynamic addition or removal of nodes to the cluster without affecting user access. Since scaling up is the most common operation in production, let’s first analyze how Redis-Cluster performs the scaling operation.

When preparing to scale Redis, first prepare the new nodes, deploy Redis, configure cluster-enable to true, and start the nodes. Then, the operators connect to a Redis node in the cluster using a client and use the cluster meet command to add the new node to the cluster. The new node will then notify the other nodes in the cluster that a new node has joined. Since the newly added node has not been assigned any slots, it does not accept any read or write operations.

Next, use the cluster setslot command to set the target slot to “importing” status on the new node. Then, use the cluster setslot command to set the source node of the slot to “migrating” status.

Next, obtain the keys to be migrated from the source node by using the cluster getkeysinslot command. Get N keys to be migrated from the slot. Then, using the migrate command, migrate these keys one by one or in bulk to the target new node. For migrating a single key, use the command migrate (host) (port) (key) (dbid) timeout. If migrating multiple keys at once, add the keys option at the end of the command and place multiple keys after it. Keep repeating the previous two steps to continuously get keys in the slot and perform migration until all data in the slot has been migrated to the target new node. Finally, use the cluster setslot command to assign this slot to the new node. The setslot command can be sent to any node in the cluster, and that node will propagate this assignment information to the entire cluster. With these steps, the slot is migrated to the new node. If multiple slots need to be migrated, continue with the migration steps until all slots to be migrated have their data moved to the new node.

The new node that has migrated the slots is a master node. For online applications, it is necessary to add slave nodes to increase read and write capacity as well as availability. Otherwise, if the master node crashes, the data of the entire shard cannot be accessed. Adding a slave node to a node is different from non-cluster mode, and the slaveof command cannot be used. Instead, the cluster replication command must be used to add slaves under the cluster shard node. In addition, for cluster mode, a slave can only be attached to a master node of the shard, and a slave node itself cannot have any slaves.

The Redis community officially provides redis-trib.rb as a management tool for Redis Cluster in the source code. This tool is developed in Ruby, so before using it, you need to install the required dependencies. Redis-trib encapsulates the Redis commands mentioned earlier, enabling various functionalities such as creating clusters, checking clusters, adding and removing nodes, and online slot migration.

During slot migration in Redis Cluster, the commands to get keys and perform migration are sent and executed one by one without affecting normal client access. However, when migrating a single or multiple keys, the Redis node is in a blocking state. This means that when Redis starts executing the migration command, it will block until the migration is successful or confirmed as a failure before stopping the migration for that key and continuing with other requests. The migration of keys within a slot is performed using the migrate command.

When the source node receives the migrate (host) (port) (key) (destination-db) command, the source node for slot migration establishes a socket connection with the target node for migration. If it’s the first migration or if the current DB being migrated is different from the previous DB migration, before migrating the data, it needs to send select $dbid to switch to the correct DB.

Then, the source node will poll all the key/value pairs to be migrated. It will retrieve the expiration time of the key and serialize the value. The serialization process involves dumping the value into the binary format used by the RDB storage. This binary format consists of three parts. The first part is the type of the value object. The second part is the actual binary data of the value. The third part is the version of the current RDB format and the CRC64 checksum of the value. At this point, the data to be sent for migration is ready. The source node sends the “restore-asking” command to the target node, sending the expiration time, key, and binary data of the value. Then, it waits for the response from the target node.

The client corresponding to the target node, upon receiving the command, first switches to the correct database if there is a “select” command. It then reads and processes the “restore-asking” command. When processing the “restore-asking” command, it first parses and verifies the received data, retrieves the TTL of the key, verifies the RDB version and the CRC64 checksum of the value data. After confirming the correctness, it stores the data in the redisDb, sets the expiration time, and returns a response.

After receiving the successful response from the target node, for non-copy-type migrations, the source node will delete the migrated keys. At this point, the migration of the keys is complete. The “migrate” migration command can migrate one or more keys at a time. Note that during the entire migration process, after the source node sends the “restore-asking” command, it waits synchronously and blocks, waiting for the target node to complete data processing, until timeout or the target node returns a response. After receiving the result, it processes the local events, and then stops blocking to continue processing other events. Therefore, the number of keys migrated at a time should not be too large, otherwise the blocking time will be longer, leading to Redis freezing. Also, even if only one key is migrated at a time, if the corresponding value is too large, it may also cause Redis to temporarily freeze.

During slot migration, not only can other keys in non-migrating slots be accessed normally, but also the keys in the migrating slot can be read and written normally, without affecting business operations. However, since key migration is a blocking mode, i.e., during the key migration process, the source node does not process any requests, there are only three possible states for the keys to be read and written during the slot migration.

1. Not yet migrated, will be migrated in the future;
2. Already migrated;
3. This key did not exist in the cluster before and is a new key.

During slot migration, the handling of keys in the nodes is as follows.

For key that has not been migrated yet, i.e., the key can be found in the DB, regardless of whether the slot to which this key belongs is being migrated, it is directly processed for read and write locally.
For the key whose value cannot be found from the DB but belong to a migrating slot, including the two states of already migrated or originally non-existent key, Redis returns an “ask” error response to the client, along with the host and port of the target node for slot migration. After receiving the “ask” response, the client redirects the request to the new node for slot migration to complete the response processing.
For the key whose value cannot be found from the DB and the slot to which the key belongs does not belong to the current node, it indicates that the client sending the node is incorrect, and a “moved” error response is returned directly, along with the host and port of the node corresponding to the key. The client then redirects the request.
For Redis Cluster solution, it is implemented by the official community and has the Redis-trib cluster tool, which is convenient for deployment and use. It also supports online scaling and shrinking, and the status of the cluster can be viewed anytime using the tool. But this solution also has many drawbacks. First, the data storage and cluster logic are coupled, making the code logic complex and prone to errors.
Secondly, the Redis node needs to store the mapping between slots and keys, which requires extra memory, especially for businesses with a small value size and relatively large keys, the impact is more significant.
Moreover, the key migration process is a blocking mode, and migrating large values will cause the service to freeze. Moreover, during the migration process, the key is first retrieved and then migrated, resulting in low efficiency.
Finally, under the Cluster mode, the slave nodes for cluster replication can only be attached to a master node and do not support nested slaves, which can cause excessive pressure on the master and cannot support business scenarios that require a large number of slaves and high read TPS.