38 Communication Bottlenecks Restrict Redis Cluster Scaling Key Factors

38 Communication Bottlenecks Restrict Redis Cluster Scaling Key Factors #

The amount of data that Redis Cluster can store and the throughput it can support are closely related to the scale of the cluster instances. Redis officially specifies the upper limit of Redis Cluster, which is running 1000 instances in a cluster.

So, you may wonder why the scale of the cluster is limited? In fact, a key factor here is that the communication overhead between instances increases as the scale of the instances grows, and when the cluster exceeds a certain scale (such as 800 nodes), the throughput of the cluster may decrease. Therefore, the actual scale of the cluster is limited.

In today’s lesson, we will discuss how the communication overhead between cluster instances affects the scale of Redis Cluster and how to reduce the communication overhead between instances. By mastering today’s content, you will be able to expand the scale of Redis Cluster through proper configuration while maintaining high throughput.

Instance Communication Method and its Impact on Cluster Scale #

During the runtime of Redis Cluster, each instance maintains the mapping between slots and instances (known as the slot mapping table) as well as its own state information.

To ensure that each instance in the cluster has knowledge of the state information of all other instances, instances communicate with each other according to a certain rule. This rule is known as the Gossip protocol.

The working principle of the Gossip protocol can be summarized in two points.

Firstly, at regular intervals, each instance randomly selects some instances from the cluster and sends PING messages to them in order to check if they are online and exchange their state information. The PING message includes the state information of the sending instance itself, partial state information of other instances, and the slot mapping table.

Secondly, when an instance receives a PING message, it sends a PONG message back to the sender. The content of the PONG message is the same as the PING message.

The following diagram illustrates the exchange of PING and PONG messages between two instances.

The Gossip protocol guarantees that after a certain period of time, each instance in the cluster can obtain the state information of all other instances.

Therefore, even in the event of new nodes joining, node failures, or slot changes, the cluster state can be synchronized on each instance through the exchange of PING and PONG messages.

From the previous analysis, we can see that when instances communicate using the Gossip protocol, the communication overhead is affected by two factors: the size of the communication messages and the communication frequency.

The larger the message size and the higher the frequency, the greater the corresponding communication overhead. To achieve efficient communication, we can optimize these two aspects. Next, let’s analyze these two aspects in detail.

First, let’s look at the size of the Gossip messages used for instance communication.

Gossip Message Size #

The PING message sent by a Redis instance consists of the clusterMsgDataGossip structure, which is defined as follows:

typedef struct {
    char nodename[CLUSTER_NAMELEN];  // 40 bytes
    uint32_t ping_sent; // 4 bytes
    uint32_t pong_received; // 4 bytes
    char ip[NET_IP_STR_LEN]; // 46 bytes
    uint16_t port;  // 2 bytes
    uint16_t cport;  // 2 bytes
    uint16_t flags;  // 2 bytes
    uint32_t notused1; // 4 bytes

The total size of the structure is 108 bytes.

} clusterMsgDataGossip;

In this code block, the values of CLUSTER_NAMELEN and NET_IP_STR_LEN are 40 and 46 respectively. They indicate the lengths of the nodename and ip byte arrays, which are 40 bytes and 46 bytes respectively. By adding up the sizes of other information in the structure, we can obtain the size of a Gossip message, which is 104 bytes.

Each instance, when sending a Gossip message, not only passes its own status information but also passes the status information of one-tenth of the instances in the cluster by default.

Therefore, for a cluster containing 1000 instances, when each instance sends a PING message, it will contain the status information of 100 instances, with a total amount of data of 10400 bytes. Adding the information of the sending instance itself, a Gossip message is approximately 10KB in size.

In addition, in order for the Slot Mapping Table to be propagated among different instances, the PING message also carries a bitmap with a length of 16,384 bits. Each bit of this bitmap corresponds to a slot, with a value of 1 indicating that the slot belongs to the current instance. After converting the size of this bitmap to bytes, it is 2KB. Adding the instance status information and slot allocation information, we can obtain the size of a PING message, which is approximately 12KB.

The content of the PONG message is the same as that of the PING message, so its size is also approximately 12KB. After sending a PING message, each instance will receive a returned PONG message, making the total size of the two messages 24KB.

Although 24KB does not seem large in absolute terms, if a single request processed by an instance is only several KB, then the PING/PONG messages transmitted by the instance to maintain cluster state consistency will be larger than a single business request. Moreover, each instance will send PING/PONG messages to other instances. As the cluster grows, the number of these heartbeat messages increases, occupying a portion of the cluster's network communication bandwidth, which in turn reduces the throughput of normal client requests for the cluster service.

In addition to the impact of heartbeat message size on communication overhead, if the communication between instances is very frequent, it will also cause the cluster network bandwidth to be frequently occupied. So, what is the communication frequency between instances in Redis Cluster like?

### Communication Frequency Between Instances

After the Redis Cluster instance is started, it will randomly select 5 instances from the local instance list every second. From these 5 instances, it will find the instance that has not communicated for the longest time and send a PING message to that instance. This is the basic practice of periodically sending PING messages by instances.

However, there is a problem here: the instance selected as the instance that has not communicated for the longest time is ultimately selected from the randomly selected 5 instances, which does not guarantee that this instance is the one that has not communicated for the longest time in the entire cluster.

Therefore, it is possible that **some instances have never received a PING message, resulting in their maintained cluster state being already expired**.

To avoid this situation, Redis Cluster instances will scan the local instance list at a rate of once every 100ms. If it is found that the time since the last received PONG message by an instance is greater than half of the configuration item cluster-node-timeout (cluster-node-timeout/2), a PING message will be immediately sent to that instance to update the cluster state information on that instance.

When the cluster scale expands, the network communication delay between instances will increase due to network congestion or traffic competition between different servers. If some instances cannot receive PONG messages sent by other instances, it will cause instances to send PING messages frequently to each other, which will bring additional overhead to the cluster network communication.

Let's summarize the number of PING messages sent by a single instance per second as follows:

> PING message sending quantity = 1 + 10 * number of instances (time of receiving the last PONG message exceeds cluster-node-timeout/2)

Here, 1 indicates that a PING message is sent by a single instance every 1 second, and 10 indicates that the instance performs 10 checks within 1 second and sends a message to instances with PONG message timeouts after each check.

Let me analyze the situation where PING messages occupy cluster bandwidth by using an example.

Suppose a single instance detects that 10 instances' PONG messages have not been received every 100 milliseconds. Then, this instance will send 101 PING messages per second, occupying approximately 1.2MB/s bandwidth. If there are 30 instances in the cluster sending messages at this rate, it will occupy 36MB/s bandwidth, which will squeeze the bandwidth used for normal requests in the cluster's service.

So, what can we do to reduce the communication overhead between instances?

How to reduce communication overhead between instances? #

To reduce communication overhead between instances, in theory, we can decrease the message size that instances transmit (PING/PONG messages, slot allocation information). However, since cluster instances rely on PING, PONG messages, and slot allocation information to maintain unified cluster status, reducing the message size will lead to reduced communication information between instances, which is not conducive to cluster maintenance. Therefore, we cannot adopt this approach.

So, can we reduce the frequency of message sending between instances? Let’s analyze it.

After our previous study, we now know that the frequency of message sending between instances has two aspects:

  • Each instance sends a PING message every 1 second. This frequency is not high, and if it is further reduced, the state of each instance in the cluster may not be able to propagate in time.
  • Every 100 milliseconds, each instance performs a check and sends a PING message to nodes that have not received PONG messages for more than cluster-node-timeout/2 milliseconds. The frequency of the check every 100 milliseconds is the default periodic check task for Redis instances, and we generally do not need to modify it.

Therefore, the only configuration item that can be modified is cluster-node-timeout.

The cluster-node-timeout configuration item defines the heartbeat timeout period for judging the failure of cluster instances, and the default is 15 seconds. If the value of cluster-node-timeout is relatively small, in large-scale clusters, the situation of PONG message reception timeout will occur frequently, resulting in instances executing the “send PING messages to instances with PONG message timeouts 10 times per second” operation.

Therefore, to avoid excessive heartbeat messages occupying cluster bandwidth, we can increase the cluster-node-timeout value, such as increasing it to 20 seconds or 25 seconds. In this way, the situation of PONG message reception timeout will be alleviated, and individual instances will not need to perform the heartbeat sending operation 10 times per second so frequently.

Of course, we should not set the cluster-node-timeout too large. Otherwise, if a real instance failure occurs, we would need to wait for the duration of cluster-node-timeout to detect this failure, which will further delay the actual fault recovery time and affect the normal use of cluster services.

To verify whether adjusting the cluster-node-timeout value can reduce the network bandwidth occupied by heartbeat messages, I have a suggestion for you: You can use the tcpdump command to capture the instances’ heartbeat message network packets before and after adjusting the cluster-node-timeout value.

For example, after executing the following command, we can capture the heartbeat network packets sent by instances on the 192.168.10.3 machine from port 16379 and save the contents of the network packets to the r1.cap file:

tcpdump host 192.168.10.3 port 16379 -i {network_interface} -w /tmp/r1.cap

By analyzing the number and size of network packets, you can determine the bandwidth occupied by heartbeat messages before and after adjusting the cluster-node-timeout value.

Summary #

In this lesson, I introduced to you the mechanism of communication between Redis Cluster instances using the Gossip protocol. When running Redis Cluster, instances need to exchange information through PING and PONG messages. These heartbeat messages contain the current instance’s status information, as well as slot allocation information for some other instances. This communication mechanism helps ensure that all instances in the Redis Cluster have complete cluster state information.

However, as the size of the cluster increases, the amount of communication between instances also increases. If we blindly scale Redis Cluster, we may encounter performance degradation in the cluster. This is because the large amount of heartbeat messages between instances can take up bandwidth that would otherwise be used for processing normal requests in the cluster. Additionally, some instances may not receive PONG messages in a timely manner due to network congestion. Each instance periodically checks for this scenario (10 times per second) during runtime, and if it occurs, it immediately sends heartbeat messages to the instances that have timed out with PONG messages. The larger the cluster, the higher the probability of network congestion and the higher the probability of PONG message timeouts, which results in a large number of heartbeat messages in the cluster, affecting the processing of normal requests in the cluster.

Finally, I also have a suggestion for you. Although we can reduce the bandwidth occupied by heartbeat messages by adjusting the cluster-node-timeout configuration, in practical applications, unless there is a specific need for a large-scale cluster, I recommend keeping the size of your Redis Cluster between 400 to 500 instances.

Assuming a single instance can handle 80,000 request operations per second (80,000 QPS), and each master instance is configured with one replica instance, then 400 to 500 instances can support 16 to 20 million QPS (200/250 master instances * 80,000 QPS = 16/20 million QPS). This throughput performance can meet the needs of many business applications.

One Question per Lesson #

As usual, I have a small question for you. If we adopt a method similar to Codis to store the cluster instance status information and slot allocation information in a third-party storage system (such as Zookeeper), how would this method affect the scale of the cluster?

Please feel free to share your thoughts and answers in the comment section. Let’s discuss and exchange ideas together. If you found today’s content helpful, please feel free to share it with your friends or colleagues. See you in the next lesson.