21 Rocket Mq Cluster Alerts

21 RocketMQ Cluster Alerts #

Introduction #

Cluster health inspection, usage of topics, and consumption group resource inspection are performed to detect whether certain thresholds are reached and send alarm messages to administrators or resource applicants. Monitoring is the foundation of alarm, and the alarm inspection is based on the data collected in the previous two articles.

The importance of alarms cannot be overstated, as the RocketMQ cluster often carries the company’s core business flow. If the cluster becomes unavailable, it usually affects the entire company’s business, and the responsibility for accidents is at the highest level within the company.

This article provides guidance and suggestions on the design, process, and practical application of alarms. In practice, use this as a starting point to further enhance and implement customized alarms for your own company.

Design of Alarm Items #

The following figure lists the important alarm items and trigger conditions, including topics, consumption groups, and clusters.

Trigger Conditions #

Threshold: Exceeds a certain specific value, for example, when the backlog of consumption exceeds 100,000.
Time interval: How often to check, for example, if the backlog of consumption exceeds 100,000 within 5 minutes.
Trigger count: The number of times to satisfy the threshold within the time interval, for example, if the backlog of consumption exceeds 100,000 within 5 minutes, trigger 3 times.
Alarm time range: The time range to receive alarm notifications, for example, receiving alarm messages between 9:00-22:00.

Topic Alarm #

Sending Speed: Send alarm messages when the sending speed meets the threshold set by trigger conditions.

For example: When the sending speed is less than the threshold of 10 within 5 minutes, trigger once, and trigger the alarm message during 00:00-23:59.

Consumption Alarm #

Consumption Speed: Send alarm messages when the consumption speed meets the threshold set by trigger conditions.

For example: When the consumption speed is less than the threshold of 5,000 within 5 minutes, trigger once, and trigger the alarm message during 00:00-23:59.

Consumption Backlog: Send alarm messages when the backlog of consumption meets the threshold set by trigger conditions.

For example: When the backlog of consumption exceeds the threshold of 100,000 within 5 minutes, trigger once, and trigger the alarm message during 00:00-23:59.

Cluster Alarm #

Cluster Node Count: Trigger an alarm when the number of cluster nodes meets the threshold set by trigger conditions.

For example: When the number of cluster nodes is less than the threshold of 4 within 5 minutes, trigger once, and trigger the alarm message during 00:00-23:59.

Cluster Response Time: Trigger an alarm when the response time sent by the cluster node meets the threshold set by trigger conditions.

For example: When the response time sent by the node exceeds 1 second within 5 minutes, trigger once, and trigger the alarm message during 00:00-23:59.

Cluster Write TPS: Trigger an alarm when the write TPS of the cluster meets the threshold set by trigger conditions.

For example: When the write TPS of the cluster exceeds 40,000 within 5 minutes, trigger once, and trigger the alarm message during 00:00-23:59.

Cluster Node Availability: Trigger an alarm when the heartbeat detection result of the cluster node meets the threshold set by trigger conditions.

For example: When the heartbeat detection result of the node exceeds 0 (indicating failure) within 5 minutes, trigger once, and trigger the alarm message during 00:00-23:59.

Cluster Write Change Rate: Trigger an alarm when the write TPS change rate of the cluster meets the threshold set by trigger conditions.

For example: When the write TPS change rate of the cluster exceeds 100% within 5 minutes, trigger once, and trigger the alarm message during 00:00-23:59.

Alarm Development Practice #

Alarm Process #

Scheduled task inspection: You can use the company’s scheduling platform or write your own scheduling thread using ScheduledExecutorService. The frequency of scheduling can be divided into different scheduling tasks based on different indicators. For example, cluster alarms can use second-level detection, while topic and consumption group alarms can use minute-level detection.

Retrieve monitoring data: The data comes from the monitoring data stored in the previous two sections, such as stored in the time-series database InfluxDB.

Send alarm messages: You can send them to the company’s unified alarm system, or send them to DingTalk, email, SMS, etc.

Topic/Consumer Dynamic SQL #

We can generate different query statements by configuring different alarm rules on the interface and use the generated statements during scheduled tasks.

By making selections similar to the ones shown in the above figure for topics and consumer groups, we can dynamically generate SQL statements. For example, when selecting the following dynamic rule parameters, cluster name demo_cluster, consumer group name demo_consumer, type consumer, indicator backlog, greater than, threshold 1,000,000, interval 5 minutes, count 1 time, alarm start time 00:00, and alarm end time 23:59, the following statement is generated.

select Count(value) FROM "consumer_monitor_info" WHERE "clusterName" =