22 Rocket Mq Cluster Failure Recording

22 RocketMQ Cluster Failure Recording #

Mysterious Disappearance of Cluster Nodes #

Phenomenon Description #

Upon receiving alerts and feedback from operations, a RocketMQ node went missing. Such a phenomenon had never occurred before, so there must be a reason for the disappearance. We started searching through the logs, including the broker.log, stats.log, storeerror.log, store.log, and watermark.log in the cluster, as well as the system message logs, but no error logs were found. There were no significant changes in the cluster’s traffic inflow and outflow, CPU usage, CPU load, disk I/O, memory, or bandwidth.

Cause Analysis #

Continuing to search for the cause, we finally checked the historical operations through the history command. We found that an operations colleague had started the Broker without running it in the background, but directly in the current session.

sh bin/mqbroker -c conf/broker-a.conf

The problem originated from this command, as the Broker node also exited when the session expired.

Solution #

Standardize operational procedures and review each operation by operations. It would be even better to automate standardized operations.

Correct way to start the Broker:

nohup sh bin/mqbroker -c conf/broker-a.conf &

High CPU Usage on Master Node #

Phenomenon Description #

After frequent fluctuations, the CPU usage of the RocketMQ master node remained high and then dropped down. Business timeouts became severe. Since two slave nodes were deployed on the same machine, the slave nodes experienced direct crashes as well.

Screenshot of CPU spike on the master node:

Screenshot of CPU spike on the slave node:

Note: The missing part in the middle is due to disconnection and not being collected.

System Error Log 1

2020-03-16T17:56:07.505715+08:00 VECS0xxxx kernel: <IRQ>  [<ffffffff81143c31>] ? __alloc_pages_nodemask+0x7e1/0x960
2020-03-16T17:56:07.505717+08:00 VECS0xxxx kernel: java: page allocation failure. order:0, mode:0x20
2020-03-16T17:56:07.505719+08:00 VECS0xxxx kernel: Pid: 12845, comm: java Not tainted 2.6.32-754.17.1.el6.x86_64 #1
2020-03-16T17:56:07.505721+08:00 VECS0xxxx kernel: Call Trace:
2020-03-16T17:56:07.505724+08:00 VECS0xxxx kernel: <IRQ>  [<ffffffff81143c31>] ? __alloc_pages_nodemask+0x7e1/0x960
2020-03-16T17:56:07.505726+08:00 VECS0xxxx kernel: [<ffffffff8148e700>] ? dev_queue_xmit+0xd0/0x360
2020-03-16T17:56:07.505729+08:00 VECS0xxxx kernel: [<ffffffff814cb3e2>] ? ip_finish_output+0x192/0x380

System Error Log 2

30 2020-03-27T10:35:28.769900+08:00 VECSxxxx kernel: INFO: task AliYunDunUpdate:29054 blocked for more than 120 seconds.
31 2020-03-27T10:35:28.769932+08:00 VECSxxxx kernel:      Not tainted 2.6.32-754.17.1.el6.x86_64 #1
32 2020-03-27T10:35:28.771650+08:00 VECS0xxxx kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
33 2020-03-27T10:35:28.774631+08:00 VECS0xxxx kernel: AliYunDunUpda D ffffffff815592fb     0 29054      1 0x10000080
34 2020-03-27T10:35:28.777500+08:00 VECS0xxxx kernel: ffff8803ef75baa0 0000000000000082 ffff8803ef75ba68 ffff8803ef75ba64

Explanation: The system log displays errors “page allocation failure” and “blocked for more than 120 seconds”. The log directory is /var/log/messages.

GC Log

2020-03-16T17:49:13.785+0800: 13484510.599: Total time for which application threads were stopped: 0.0072354 seconds, Stopping threads took: 0.0001536 seconds
2020-03-16T18:01:23.149+0800: 13485239.963: [GC pause (G1 Evacuation Pause) (young) 13485239.965: [G1Ergonomics (CSet Construction) start choosing CSet, _pending_cards: 7738, predicted base time: 5.74 ms, remaining time: 194.26 ms, target pause time: 200.00 ms]
13485239.965: [G1Ergonomics (CSet Construction) add young regions to CSet, eden: 255 regions, survivors: 1 regions, predicted young region time: 0.52 ms]
13485239.965: [G1Ergonomics (CSet Construction) finish choosing CSet, eden: 255 regions, survivors: 1 regions, old: 0 regions, predicted pause time: 6.26 ms, target pause time: 200.00 ms]
, 0.0090963 secs]
   [Parallel Time: 2.3 ms, GC Workers: 23]
      [GC Worker Start (ms): Min: 13485239965.1, Avg: 13485239965.4, Max: 13485239965.7, Diff: 0.6]
      [Ext Root Scanning (ms): Min: 0.0, Avg: 0.3, Max: 0.6, Diff: 0.6, Sum: 8.0]
      [Update RS (ms): Min: 0.1, Avg: 0.3, Max: 0.6, Diff: 0.5, Sum: 7.8]
         [Processed Buffers: Min: 2, Avg: 5.7, Max: 11, Diff: 9, Sum: 131]
      [Scan RS (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 0.8]
      [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.3]
      [Object Copy (ms): Min: 0.2, Avg: 0.5, Max: 0.7, Diff: 0.4, Sum: 11.7]
      [Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.3]
         [Termination Attempts: Min: 1, Avg: 1.0, Max: 1, Diff: 0, Sum: 23]
      [GC Worker Other (ms): Min: 0.0, Avg: 0.2, Max: 0.3, Diff: 0.3, Sum: 3.6]
      [GC Worker Total (ms): Min: 1.0, Avg: 1.4, Max: 1.9, Diff: 0.8, Sum: 32.6]
      [GC Worker End (ms): Min: 13485239966.7, Avg: 13485239966.9, Max: 13485239967.0, Diff: 0.3]
   [Code Root Fixup: 0.0 ms]
   [Code Root Purge: 0.0 ms]
   [Clear CT: 0.9 ms]
   [Other: 5.9 ms]
      [Choose CSet: 0.0 ms]
      [Ref Proc: 1.9 ms]
      [Ref Enq: 0.0 ms]
      [Redirty Cards: 1.0 ms]
      [Humongous Register: 0.0 ms]
      [Humongous Reclaim: 0.0 ms]
      [Free CSet: 0.2 ms]
   [Eden: 4080.0M(4080.0M)->0.0B(4080.0M) Survivors: 16.0M->16.0M Heap: 4176.5M(8192.0M)->96.5M(8192.0M)]
 [Times: user=0.05 sys=0.00, real=0.01 secs]

Explanation: The GC log is normal.

Broker Error Log

2020-03-16 17:55:15 ERROR BrokerControllerScheduledThread1 - SyncTopicConfig Exception, x.x.x.x:10911 
org.apache.rocketmq.remoting.exception.RemotingTimeoutException: wait response on the channel <x.x.x.x:10909> timeout, 3000(ms)
        at org.apache.rocketmq.remoting.netty.NettyRemotingAbstract.invokeSyncImpl(NettyRemotingAbstract.java:427) ~[rocketmq-remoting-4.5.2.jar:4.5.2]

Explanation: The broker error log shows an exception “SyncTopicConfig Exception” with a timeout on the channel x.x.x.x:10909. The timeout is 3000 milliseconds. at org.apache.rocketmq.remoting.netty.NettyRemotingClient.invokeSync(NettyRemotingClient.java:375) ~[rocketmq-remoting-4.5.2.jar:4.5.2]

Explanation: By checking the RocketMQ cluster and GC logs, it can only be concluded that the network is not available, causing the master-slave synchronization problem; no issues with the Broker itself have been found.

Cause analysis #

The system uses CentOS 6 with kernel version 2.6. No problems with the broker and GC itself were found through investigation, but frequent “page allocation failure” and “blocked for more than 120 seconds” errors were found in the system message logs. Therefore, attention was focused on the system level. Attempts were made to adjust system parameters, such as min_free_kbytes and zone_reclaim_mode, but the CPU jitter problem could not be eliminated. After consulting with the community, it was determined that there is a bug in the memory recovery of the 2.6 kernel version. We decided to upgrade the operating system of the cluster.

Solution #

Upgrade the cluster’s CentOS 6 to CentOS 7 and upgrade the kernel version from 2.6 to 3.10. After the upgrade, the CPU jitter problem no longer occurs. The upgrade was performed by first increasing the capacity and then reducing it. The CentOS 7 nodes were added to the cluster first, and then the CentOS 6 nodes were removed. For more details, see the “RocketMQ Cluster Smooth Operation” section in the previous practical part.

Linux version 3.10.0-1062.4.1.el7.x86_64 ([[email protected]](/cdn-cgi/l/email-protection)) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-39) (GCC) ) #1 SMP Fri Oct 18 17:15:30 UTC 2019

Frequent timeout during cluster jitter #

Description #

Monitoring and business colleagues reported frequent timeouts during message sending. The specific phenomenon is as follows.

Preheating phenomenon

Explanation: The above figures show the RT monitor of the cluster when preheating (warmMapedFileEnable=true) and the logs of the Broker when preheating.

Memory transmission phenomenon

CPU jitter

Explanation: The above figures show screenshots of CPU jitter when the memory is transmitted outside the heap (transferMsgByHeap=false), insufficient system memory allocation, and Broker logs.

Cause analysis #

Both of the above manifestations can cause CPU jitter in the cluster and frequent timeouts in client message sending, which affects business operations.

Preheating setting: When preheating the file, a false value of 0 will be used to fill 1 GB as a placeholder to preallocate physical memory and prevent page faults when writing messages. However, it often leads to long disk write times, slight CPU jitter, and long sending times for specific business operations, resulting in more timeouts. From the perspective of increasing the TPS of the cluster, there is no significant difference when the preheating configuration is closed. However, from the perspective of stability, it is necessary to close it.

Off-heap memory: When the transferMsgByHeap setting is set to false, data is transmitted through off-heap memory, which reduces data copying and achieves zero-copy, resulting in higher efficiency compared to on-heap memory transmission. However, this may result in insufficient off-heap memory allocation, triggering system memory recovery and disk flushing operations. When set to true, the runtime is more stable.

Solution #

The default value of warmMapedFileEnable is false, so it can be kept as the default. If it is enabled, it can be turned off through hot updating.

bin/mqadmin updateBrokerConfig -b x.x.x.x:10911 -n x.x.x.x:9876 -k warmMapedFileEnable -v false

The default value of transferMsgByHeap is true (i.e., transmission through on-heap memory), so it can be kept as the default. If it is disabled, it can be turned on through hot updating.

bin/mqadmin updateBrokerConfig -b x.x.x.x:10911 -n x.x.x.x:9876 -k transferMsgByHeap -v true

Halved consumption performance when using this attribute #

Description #

The configurations of all nodes are set to 8C16G. RocketMQ has 20 consumer threads, and the tested consumption performance is around 15,000 TPS. Tcpdump shows that there are frequent domain name resolution processes on the consuming machine; 10.x.x.185 requests resolution to the DNS servers 100.x.x.136.domain and 10.x.x.138.domain. However, 10.x.x.185 is the IP address of the message sender machine, and the sending and consumption tests are deployed on two different machines.

Question: Why would the IP address of the message sender appear during consumption? And why does this IP address keep resolving domain names?

Cause analysis #

By dumping thread stacks, the following figure is obtained:

Code location: When consuming, getMessageId() is used to obtain consumer information from the MessageExt.bornHost property.

public class MessageExt extends Message {
    private static final long serialVersionUID = 5720810158625748049L;
    private int queueId;
    private int storeSize;
    private long queueOffset;
    private int sysFlag;
    private long bornTimestamp;
    private SocketAddress bornHost;
    private long storeTimestamp;
    private SocketAddress storeHost;
    private String msgId;
    private long commitLogOffset;
    private int bodyCRC;
    private int reconsumeTimes;
    private long preparedTransactionOffset;
}

When calling getBornHostNameString to obtain the HostName, it will perform a reverse lookup of the DNS server based on the IP address:

InetSocketAddress inetSocketAddress = (InetSocketAddress)this.bornHost;
return inetSocketAddress.getAddress().getHostName();

Solution #

Simply do not use MessageExt.bornHost.getBornHostNameString during consumption. Removing this property improves the consumption performance to 30,000 TPS on an 8C16G machine, doubling the previous performance.