18 Rocket Mq Cluster Smooth Operation Maintenance

18 RocketMQ Cluster Smooth Operation Maintenance #

Introduction #

In the operation and maintenance practice of RocketMQ clusters, whether it is the startup and shutdown of online broker nodes or the scaling of the cluster, it is desirable to be smooth and transparent to the business. As the saying goes, “Enter the night silently with the wind, moistening things without making a sound.” This article tells a series of smooth operations based on actual cases.

Graceful Node Removal #

Background #

In a self-built data center, there are 4 masters and 4 slaves, with asynchronous flushing and asynchronous replication between the masters and slaves. One day, one of the master nodes lost all the passwords for the accounts. Although the node was running normally in the cluster, not being able to log in to the machine ultimately posed a security risk. Therefore, it was decided to remove the node gracefully.

How to remove the node gracefully?

Directly shutting down the node is not feasible because some data that has not been synchronized to the slave nodes will be lost. The recommended approach for online security is to “remove traffic first”. When there is no traffic entering or leaving the node, it is safe to perform operations on the node.

Traffic Removal #

1. Removing Write Traffic

We can remove the write traffic of the node by disabling the write permission of the broker. There are 3 permission settings for RocketMQ broker nodes: brokerPermission=2 means write-only permission, brokerPermission=4 means read-only permission, and brokerPermission=6 means read-write permission. By using the updateBrokerConfig command to set the broker to read-only permission, after execution, the write traffic of the original broker will be allocated to other nodes in the cluster. Therefore, before removing the node, it is necessary to evaluate the load on the nodes in the cluster.

bin/mqadmin updateBrokerConfig -b x.x.x.x:10911 -n x.x.x.x:9876 -k brokerPermission -v 4
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option PermSize=128m; support was removed in 8.0
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=128m; support was removed in 8.0
update broker config success, x.x.x.x:10911

After setting the broker to read-only permission, observe the traffic changes on the node until the write traffic (InTPS) drops to 0, indicating that the write traffic has been removed.

bin/mqadmin clusterList -n x.x.x.x:9876
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option PermSize=128m; support was removed in 8.0
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=128m; support was removed in 8.0
#Cluster Name #Broker Name #BID #Addr #Version #InTPS(LOAD) #OutTPS(LOAD) #PCWait(ms) #Hour #SPACE
ClusterA broker-a 0 x.x.x.x:10911 V4_7_0_SNAPSHOT 2492.95(0,0ms) 2269.27(1,0ms) 0 137.57 0.1861
ClusterA broker-a 1 x.x.x.x:10911 V4_7_0_SNAPSHOT 2485.45(0,0ms) 0.00(0,0ms) 0 125.26 0.3055
ClusterA broker-b 0 x.x.x.x:10911 V4_7_0_SNAPSHOT 26.47(0,0ms) 26.08(0,0ms) 0 137.24 0.1610
ClusterA broker-b 1 x.x.x.x:10915 V4_7_0_SNAPSHOT 20.47(0,0ms) 0.00(0,0ms) 0 125.22 0.3055
ClusterA broker-c 0 x.x.x.x:10911 V4_7_0_SNAPSHOT 2061.09(0,0ms) 1967.30(0,0ms) 0 125.28 0.2031
ClusterA broker-c 1 x.x.x.x:10911 V4_7_0_SNAPSHOT 2048.20(0,0ms) 0.00(0,0ms) 0 137.51 0.2789
ClusterA broker-d 0 x.x.x.x:10911 V4_7_0_SNAPSHOT 2017.40(0,0ms) 1788.32(0,0ms) 0 125.22 0.1261
ClusterA broker-d 1 x.x.x.x:10915 V4_7_0_SNAPSHOT 2026.50(0,0ms) 0.00(0,0ms) 0 137.61 0.2789

2. Removing Read Traffic

After removing the write traffic of the broker, the read consumption traffic will gradually decrease. You can observe the change in read traffic through the OutTPS in the clusterList command. In addition to this, you can also observe the backlog (Diff) of the broker through brokerConsumeStats. When the backlog is 0, it means that all consumption has been completed.

#Topic             #Group                #Broker Name    #QID  #Broker Offset   #Consumer Offset  #Diff     #LastTime
test_melon_topic   test_melon_consumer     broker-b        0     2171742           2171742          0       2020-08-13 23:38:09
test_melon_topic   test_melon_consumer     broker-b        1     2171756           2171756          0       2020-08-13 23:38:50
test_melon_topic   test_melon_consumer     broker-b        2     2171740           2171740          0       2020-08-13 23:42:58
test_melon_topic   test_melon_consumer     broker-b        3     2171759           2171759          0       2020-08-13 23:40:44
test_melon_topic   test_melon_consumer     broker-b        4     2171743           2171743          0       2020-08-13 23:32:48
test_melon_topic   test_melon_consumer     broker-b        5     2171740           2171740          0       2020-08-13 23:35:58

3. Node Offline

Typically, the node can be taken offline when the backlog for the broker is 0. Considering the possibility of message rewind for re-consumption, the node should be taken offline after the log retention period. For example, if the log is stored for 3 days, the node should be removed after 3 days.

Smooth Scaling #

Background #

The online cluster operating system needs to be replaced from CentOS 6 to CentOS 7. The specific phenomenon and reasons are explained in the troubleshooting guide. The cluster deployment architecture consists of 4 masters and 4 slaves, as shown in the figure below. broker-a is the master node, and broker-a-s is the slave node of broker-a.

The question to consider is how to achieve a smooth replacement. The guiding principle is to “scale up before scaling down”.

Cluster Scaling Up #

Apply for 8 machines with the same configuration. The operating system of these machines is CentOS 7. They are added to the original cluster as master-slave structures. At this time, the architecture of the cluster becomes 8 masters and 8 slaves, as shown below:

broker-a, broker-b, broker-c, broker-d, and their slaves are running CentOS 6. broker-a1, broker-b1, broker-c1, broker-d1, and their slaves are running CentOS 7. Traffic is flowing in and out of all 8 masters. Thus, we have completed the smooth scaling of the cluster.

Cluster Scaling Down #

Follow the steps in the “Gracefully Removing Nodes” section to remove the traffic of broker-a, broker-b, broker-c, broker-d, and their slaves. For safety, the nodes can be taken offline after the log retention period (for example, 3 days). The remaining architecture of the cluster consists of 4 masters and 4 slaves, all running CentOS 7, as shown in the figure below. Thus, we have completed the smooth scaling down of the cluster.

Notes #

When scaling up, the newly acquired 8 CentOS 7 nodes are named in the format of broker-a1, broker-b1, broker-c1, broker-d1, instead of broker-e, broker-f, broker-g, broker-h. The reason for this naming convention is explained below. By default, the client consumption adopts an average distribution algorithm, assuming that there are four consumer nodes.

First naming convention

After scaling up, the order is as follows, with the newly added node broker-e at the end of the original cluster.

broker-a, broker-b, broker-c, broker-d, broker-e, broker-f, broker-g, broker-h

Note: When removing the traffic of broker-a, broker-b, broker-c, and broker-d during scaling down, it is found that consumer-01 and consumer-02 cannot be assigned to any broker node, causing traffic imbalance and the remaining half of the nodes cannot bear the traffic load.

Second naming convention

After scaling up, the order becomes as follows, with the newly added master node broker-a1 following the original master node broker-a.

broker-a, broker-a1, broker-b, broker-b1, broker-c, broker-c1, broker-d, broker-d1

Note: When removing the traffic of broker-a, broker-b, broker-c, and broker-d during scaling down, each consumer is assigned to the newly added broker nodes, and there is no traffic imbalance.