06 Common Message Sending Errors and Resolution Solutions

06 Common Message Sending Errors and Resolution Solutions #

In this article, I will share some common problems with message sending based on my experience using RocketMQ. The basic process will follow the pattern of encountering a problem, analyzing the problem, and solving the problem.

No route info of this topic #

The complete error stack information for “No route info of this topic” is as follows:

Many readers may say that enabling automatic topic creation on the broker side can also cause the above problem.

The routing process of RocketMQ is shown in the following diagram:

The key points are as follows:

If the broker enables automatic topic creation, it will create a default topic named TBW102 during startup and report it to the NameServer via heartbeat packets. Thus, the NameServer can return the routing information when queried.
When a message producer sends a message, it first checks the local cache. If the cache exists, the routing information is returned directly.
If the cache does not exist, the producer queries the NameServer for routing information. If the NameServer has the routing information for the topic, it returns it.
If the NameServer does not have the routing information for the topic and automatic topic creation is not enabled, the error “No route info of this topic” is thrown.
If automatic topic creation is enabled, the producer queries the routing information from the NameServer using the default topic and uses the routing information of the default topic as its own routing information. In this case, the error “No route info of this topic” will not be thrown.

In general, the error “No route info of this topic” is more common when setting up RocketMQ or when first getting started with RocketMQ. The usual troubleshooting process is as follows.

You can query the routing information through RocketMQ-Console or use the following command to query the routing information:

cd ${ROCKETMQ_HOME}/bin
sh ./mqadmin topicRoute -n 127.0.0.1:9876 -t dw_test_0003

The output is as follows:

If you cannot query the routing information through the command, check if the broker has enabled automatic topic creation. The parameter is autoCreateTopicEnable, and its default value is true. However, it is not recommended to enable this in production environment.
If automatic topic creation is enabled but the error still occurs, check if the NameServer address connected by the client (Producer) is consistent with the NameServer address configured in the Broker.

By following the steps above, the error can be resolved.

Message sending timeout #

When a message sending timeout occurs, the client’s log usually looks like this:

When the client reports a message sending timeout, the first suspect is usually the RocketMQ server, whether the broker performance is unstable and unable to handle the current load.

So how do we troubleshoot whether RocketMQ currently has performance bottlenecks?

First, we execute the following command to view the distribution of time consumed for RocketMQ message writes:

cd /${USER.HOME}/logs/rocketmqlogs/
grep -n 'PAGECACHERT' store.log | more

The output is as follows:

RocketMQ prints the distribution of time consumed for message writes every minute. From here, we can have a glimpse into whether there are any performance bottlenecks in RocketMQ message writes. The intervals are as follows:

[<=0ms]: Less than 0ms, in microseconds
[0~10ms]: Number of writes less than 10ms
[10~50ms]: Number of writes greater than 10ms but less than 50ms

The majority of writes are completed within microseconds. Based on the author’s experience, if there are more than 20 writes in the interval of 100~200ms or above, it indicates that there are certain bottlenecks in the broker. If there are only a few writes, it may be due to memory or PageCache fluctuations, which is not a big problem.

In most cases, timeouts are not directly related to the broker’s processing capacity. There is another supporting evidence: there is a fast failure mechanism in the RocketMQ broker. That is, when the broker receives a request from a client, it puts the message into a queue and processes it sequentially. If a message has been waiting in the queue for more than 200ms, fast failure is triggered, and the broker returns [TIMEOUT_CLEAN_QUEUE]broker busy to the client. This will be detailed in Part 3 of this column.

In the event of a network timeout in the RocketMQ client, garbage collection on the application side could be considered as a cause. It is possible that message sending timeouts are caused by GC pauses. I have encountered this issue during stress testing in the test environment, but have not encountered it in the production environment. Please take note of this.

RocketMQ network timeouts are usually related to network fluctuations. However, since I am not particularly good at networks, I cannot find direct evidence currently. But I can find some indirect evidence. For example, in an application that connects to both a Kafka and RocketMQ cluster, I found that all connections to both the RocketMQ cluster and Kafka cluster experience timeouts at the same time.

However, we need to solve the problem when network timeouts occur. Are there any solutions?

Our minimum expectation for a message middleware is high concurrency and low latency. From the above distribution of message sending time, we can see that RocketMQ indeed meets our expectations, with the majority of requests completed within microseconds. Therefore, the solution I propose is to reduce the message sending timeout, increase the number of retries, and increase the maximum waiting time for fast failure. The specific measures are as follows:

Increase the waiting time for fast failure on the broker side to 1000. Add the following configuration to the broker’s configuration file:

maxWaitTimeMillsInQueue=1000

The main reason for this is that in the current version of RocketMQ, the errors caused by fast failure are system_busy, which does not trigger retries. By increasing this value appropriately, we can avoid triggering this mechanism as much as possible. For more details, you can refer to Part 3 of this column, where system_busy and broker_busy will be discussed.

If the RocketMQ client version is below 4.3.0 (excluding 4.3.0):

Set the timeout time for message sending to 500ms, and set the number of retries to 6 (this value can be adjusted appropriately, but it should be larger than 3 if possible). The philosophy behind this is to timeout as quickly as possible and perform retries, because it is found that network fluctuations within the local area network are instantaneous and will recover by the next retry. Moreover, RocketMQ has a fault-tolerant mechanism, and it will try to select different brokers for retries. The relevant code is as follows:

DefaultMQProducer producer = new DefaultMQProducer("dw_test_producer_group");
producer.setNamesrvAddr("127.0.0.1:9876");
producer.setRetryTimesWhenSendFailed(5); // Number of retry attempts for synchronous sending
producer.setRetryTimesWhenSendAsyncFailed(5); // Number of retry attempts for asynchronous sending
producer.start();
producer.send(msg, 500); // Message sending timeout

If the RocketMQ client version is 4.3.0 or above: If the client version is 4.3.0 or above, because the timeout time set by the client includes the total timeout time for all retries, it is not possible to directly set the timeout time for RocketMQ’s sending API. Instead, the API needs to be wrapped and retries need to be performed externally. The example code is as follows:

public static SendResult send(DefaultMQProducer producer, Message msg, int retryCount) {
    Throwable e = null;
    for(int i = 0; i < retryCount; i++) {
        try {
            return producer.send(msg, 500); // Set timeout time to 500ms, with internal retry mechanism
        } catch (Throwable e2) {
            e = e2;
        }
    }
    throw new RuntimeException("Message sending exception", e);
}

System busy, Broker busy #

When using RocketMQ, System busy and Broker busy are common problems if the RocketMQ cluster reaches a load level of 10,000 transactions per second (TPS). For example, the following exception stack trace is shown:

There are a total of five error keywords related to RocketMQ and System busy, Broker busy:

[REJECTREQUEST]system busy
too many requests and system thread pool busy
[PC_SYNCHRONIZED]broker busy
[PCBUSY_CLEAN_QUEUE]broker busy
[TIMEOUT_CLEAN_QUEUE]broker busy

Analysis of the Principles #

Let’s start with a diagram that explains at what point in the full life cycle of message sending these errors are thrown.

According to the five types of error logs mentioned above, the reasons for triggering these errors can be summarized into three categories:

1. High PageCache pressure

The following three types of errors belong to this category:

[REJECTREQUEST]system busy
[PC_SYNCHRONIZED]broker busy
[PCBUSY_CLEAN_QUEUE]broker busy

The criterion for judging whether the PageCache is busy is based on the time it takes to acquire locks during message writing and appending to memory. The default criterion is that if the locking time exceeds 1s, it is considered a high PageCache pressure and the corresponding error log is thrown to the client.

2. Rejection strategy due to pressure on the sending thread pool

RocketMQ uses a thread pool with only one thread to handle message sending. It maintains a bounded queue with a default length of 10,000. If the number of pending requests in the queue exceeds 10,000, the rejection strategy of the thread pool is executed, and the error [too many requests and system thread pool busy] is thrown.

3. Broker-side fast failure

By default, the broker enables fast failure. This means that if the broker has not yet become busy with PageCache (locking for more than 1s), but there are some requests waiting in the message sending queue for 200ms, RocketMQ will no longer continue to queue and will directly return System busy to the client. However, because the RocketMQ client currently does not handle retries for this error, additional handling is required when solving this type of problem.

PageCache Busy Solutions #

Once a large number of PageCache busy conditions occur on the message server (locking for more than 1s when appending data to memory), it is a serious problem and requires manual intervention to resolve. The solution idea is as follows:

1. Enable transientStorePoolEnable

Enable the transientStorePoolEnable mechanism by adding the following configuration to the broker configuration file:

transientStorePoolEnable=true

The principle of transientStorePoolEnable is shown in the following diagram:

The introduction of transientStorePoolEnable can alleviate the pressure on PageCache for the following reasons:

Messages are first written to off-heap memory. Because this memory uses memory locking mechanism, writing messages to it is almost equivalent to directly operating on the memory, ensuring performance.
After the messages are written to off-heap memory, a background thread is started to batch submit the messages to PageCache. In other words, the write operations to PageCache are changed from single writes to batch writes, reducing the pressure on PageCache.

The introduction of transientStorePoolEnable may increase the possibility of data loss. If the broker JVM process exits abnormally, the messages that have been submitted to PageCache will not be lost, but the messages that are still in off-heap memory (DirectByteBuffer) and have not been submitted to PageCache will be lost. However, in most cases, the possibility of RocketMQ process exiting is small. Typically, if transientStorePoolEnable is enabled, the message sender needs to have a re-push mechanism (compensation idea).

2. Scaling out

If PageCache-level busy conditions still occur after transientStorePoolEnable is enabled, the cluster needs to be scaled out or the topic in the cluster needs to be split. That is, part of the topic is migrated to another cluster to reduce the load on the cluster. Regarding RocketMQ graceful shutdown and scaling out solutions, I will have a dedicated topic in the Operations section of this column.

Friendly reminder: When Broker busy occurs due to PageCache busy, the RocketMQ client has a retry mechanism.

TIMEOUT_CLEAN_QUEUE Solution #

Because the client does not currently retry when TIMEOUT_CLEAN_QUEUE occurs, the current suggestion is to increase the timeout judgment criteria appropriately. This can be done by adding the following configuration to the broker configuration file:

# The default value is 200, which means 200ms
waitTimeMillsInSendQueue=1000

Please note that when it comes to Broker busy, I have published two articles that you may also find helpful:

Conclusion #

This article mainly analyzes the common problems encountered in message sending in practice and proposes solutions to address them.