36 How Should You Monitor Kafka

36 How Should You Monitor Kafka #

Hello, I am Hu Xi. Today I want to share with you the topic of how to monitor Kafka.

Monitoring Kafka has always been a difficult problem. Whether it is on the WeChat public account that I maintain or in the Kafka QQ group, the most frequently asked question is definitely about monitoring Kafka. The questions asked by everyone may seem diverse, but what they really want to know is how to monitor, that is, what should I monitor and how to monitor it. So today, let’s discuss this matter in detail.

Personally, I believe that similar to the saying “treat the headache with a headache pill and treat the foot pain with a foot pain pill”, when monitoring Kafka, if we only monitor the brokers, it is inevitable to have a biased view. Although the individual broker processes are part of the Kafka application, they are also normal Java processes and operating system processes. Therefore, I think it is necessary to monitor from the three dimensions of Kafka host, JVM, and Kafka cluster itself.

Host Monitoring #

Host-level monitoring is often the first step in identifying online problems. Host monitoring refers to monitoring the performance of the machines on which the Kafka cluster brokers are located. Typically, a host runs various application processes that collectively use all the hardware resources on the host, such as CPU, memory, and disk.

Common host monitoring metrics include but are not limited to the following:

Machine Load
CPU Usage
Memory Usage, including free memory and used memory
Disk I/O Usage, including read and write usage
Network I/O Usage
TCP Connection Count
Open File Count
Inode Usage

Considering that our goal is not to learn about tuning and monitoring host performance systematically, I do not intend to explain each of the above metrics in detail. I will focus on sharing the monitoring methods for machine load and CPU usage. I will use the Linux platform as an example to illustrate, but other platforms should be similar.

First, let’s look at an image. I ran the top command on the host where a Kafka cluster broker is running, and the output is shown in the following figure:

Sample Top Output

In the upper right corner of the image, we can see three values for the load average: 4.85, 2.76, and 1.26, which represent the average load values over the past 1 minute, 5 minutes, and 15 minutes, respectively. In this example, my host has a total of 4 CPU cores, but the load value is 4.85, which indicates that some processes are temporarily unable to “grab” any CPU resources. At the same time, the load value keeps increasing, indicating that the load on this host is increasing.

Using this example, what I really want to talk about is CPU usage. Many people consider the output value in the “%CPU” column of the top command as the CPU usage. For example, in the image above, the Java process with PID 2637 is the broker process, and its corresponding “%CPU” value is 102.3. Don’t assume that this is the actual CPU usage. The true meaning of this value is the average usage of all CPUs used by the process, but the top command converts it to a single CPU when displaying. Therefore, if you are on a multi-core host, this value may exceed 100. In this example, my host has 4 CPU cores, and the total CPU usage is 102.3, so the average usage per CPU is roughly 25%.

JVM Monitoring #

In addition to host monitoring, another important dimension of monitoring is JVM monitoring. Kafka Broker processes are normal Java processes, so all the monitoring methods related to JVM are applicable here.

Monitoring the JVM processes is mainly to fully understand your application (Know Your Application). Specifically for Kafka, it is to fully understand the Broker process. For example, what is the heap size of the Broker process, the sizes of the young generation and old generation? Which garbage collector (GC) is being used? There are various monitoring metrics and configuration parameters, and usually you don’t need to focus on all of them, but you should at least understand the frequency and duration of Minor GC and Full GC, the total size of active objects, and the approximate number of application threads on the JVM. Because these data will be important references for tuning the Kafka Broker in the future.

Let me give you a simple example. Suppose that after a Full GC, the size of live active objects on the heap is 700MB for a Broker process running on a host. In practical scenarios, you can safely set the heap size of the old generation to be 1.5 times or 2 times that value, which is about 1.4GB. Don’t underestimate the number 700MB, it is an important basis for setting the heap size of the Broker!

Many people may have this question: how should I set the heap size for the Broker? In fact, this is the most reasonable evaluation method. Just think, if your Broker survives 700MB of data after a Full GC, and you set the heap size to be 16GB, is this reasonable? How long does it take to perform GC on a 16GB heap?!

Therefore, let’s summarize. To monitor a JVM process, there are 3 metrics that you need to constantly pay attention to:

Frequency and duration of Full GC. This metric helps you evaluate the impact of Full GC on the Broker process. Long pauses will cause various timeout exceptions on the Broker side.
Size of active objects. This metric is an important basis for setting the heap size and can also help you fine-tune the heap size of various generations in the JVM.
Total number of application threads. This metric helps you understand the CPU usage of the Broker process.

In short, the more you understand the Broker process, the more effective your JVM tuning will be.

Speaking of specific monitoring, the first two can be viewed through GC logs. For example, the following GC log indicates the size of live objects on the heap after GC:

2019-07-30T09:13:03.809+0800: 552.982: [GC cleanup 827M->645M(1024M), 0.0019078 secs]

By default, the Broker JVM process uses the G1 GC algorithm, and after the cleanup step, the size of live objects on the heap is reduced from 827MB to 645MB. In addition, you can calculate the interval and frequency of each GC based on the previous timestamp.

Since version 0.9.0.0, the community has set the default GC collector to G1, and Full GC in G1 is executed by a single thread, which is very slow. Therefore, you must monitor your Broker’s GC logs, specifically the files starting with kafkaServer-gc.log. Pay attention to the absence of the phrase “Full GC”. Once you find frequent Full GC in the Broker process, you can enable the -XX:+PrintAdaptiveSizePolicy option in G1, allowing the JVM to tell you who triggered the Full GC.

Cluster Monitoring #

After discussing host and JVM monitoring, now I will provide several methods for monitoring Kafka clusters.

1. Check if the broker process is started and if the port is established.

Do not underestimate this point. In many containerized Kafka environments, such as when starting Kafka Broker using Docker, although the container starts successfully, if there is a network misconfiguration inside it, the process may have started but the port may not have been established. Therefore, you must check both of these points to ensure that the service is running correctly.

2. View key logs on the broker side.

The key logs here mainly include the broker server log server.log, the controller log controller.log, and the topic partition state change log state-change.log. Among them, server.log is the most important, and it is best to keep an eye on it at all times. Many critical errors on the broker side will be displayed in this file. Therefore, if your Kafka cluster experiences a failure, you should check the corresponding server.log immediately to find and locate the cause of the failure.

3. Check the running status of key threads on the broker side.

Unexpected failure of these key threads often goes unnoticed but has a huge impact. For example, there is a dedicated thread on the broker side that performs Log Compaction operations. Due to a bug in the source code, this thread sometimes “dies” for no reason, and many Jira issues have been reported about this problem in the community. After this thread fails, as a user, you will not receive any notifications, and the Kafka cluster will continue to run normally, but all Compaction operations will be interrupted, resulting in the Kafka internal offset topic occupying more and more disk space. Therefore, it is necessary to monitor the status of these key threads.

However, a Kafka Broker process will start several or even dozens of threads, and it is not possible to monitor each thread in real-time. So, let me share with you the two categories of threads that I consider to be the most important. In actual production environments, it is very necessary to monitor the running status of these two categories of threads.

Log Compaction threads, which are threads starting with kafka-log-cleaner-thread. As mentioned earlier, this thread is responsible for log compaction. Once it fails, all compaction operations will be interrupted, but users are usually not aware of this.
Replica fetcher threads, usually starting with ReplicaFetcherThread. These threads execute the logic of Follower replicas fetching messages from Leader replicas. If they fail, the system will show that the corresponding Follower replicas no longer fetch messages from Leader replicas, resulting in an increasing Lag for the Follower replicas.

Whether you use the jstack command or other monitoring frameworks, I recommend that you always pay attention to the running status of these two categories of threads in the Broker process. Once you find that their statuses have changed, immediately check the corresponding Kafka logs to locate the issue, as this usually indicates a more serious error.

4. View key JMX metrics on the broker side. Kafka provides a large number of JMX metrics for real-time monitoring. Let me introduce some important Broker-side JMX metrics:

BytesIn/BytesOut: The number of bytes received and sent by the Broker per second. You should ensure that these values do not approach your network bandwidth, as this usually indicates that the network card is saturated and can easily lead to network packet loss.
NetworkProcessorAvgIdlePercent: The average idle percentage of the network thread pool threads. Generally, you should ensure that this JMX value is consistently above 30%. If it is lower than this value, it means that your network thread pool is very busy, and you need to reduce the load on the Broker by increasing the number of network threads or transferring the load to other servers.
RequestHandlerAvgIdlePercent: The average idle percentage of the I/O thread pool threads. Similarly, if this value is consistently lower than 30%, you need to adjust the number of I/O threads or reduce the load on the Broker.
UnderReplicatedPartitions: The number of under-replicated partitions. Under-replicated means that not all follower replicas are in sync with the leader replica. Once this situation occurs, it usually indicates that data loss may occur in that partition. Therefore, this is a very important JMX metric.
ISRShrink/ISRExpand: The frequency of shrinking and expanding the In-Sync Replicas (ISR). If the replicas frequently enter and exit the ISR in your environment, this group of values will be high. In this case, you need to diagnose the reason for the frequent changes in replica status and take appropriate measures.
ActiveControllerCount: The number of currently active controllers. Normally, the value of this JMX metric on the Broker where the controller resides should be 1, and it should be 0 on other Brokers. If you find that this value is 1 on multiple Brokers, you need to handle it promptly, mainly by checking network connectivity. This situation usually indicates a split-brain in the cluster. Split-brain is a very serious distributed failure, and Kafka currently relies on ZooKeeper to prevent split-brain. However, once a split-brain occurs, Kafka cannot guarantee normal operation.

In fact, there are many more JMX metrics on the Broker side. In addition to the above important metrics, you can also check other JMX metrics on the official website according to your business needs and integrate them into your monitoring framework.

5. Monitor Kafka clients.

We also need to closely monitor the performance of client programs. Whether it is a producer or a consumer, the first thing we need to pay attention to is the network round-trip time (RTT) between the client machine and the Kafka Broker machine. In simple terms, you need to ping the IP of the Broker host on the client machine to see what the RTT is.

I once served a client whose Kafka producer had very low TPS. When I logged into the machine and checked, I found that the RTT was 1 second. In this case, no matter how you tune Kafka parameters, the effect will not be very significant. Reducing network latency is the most direct and effective way.

In addition to RTT, there are also critical threads in the client program that you need to monitor. For producers, there is a thread starting with kafka-producer-network-thread that you need to monitor in real-time. It is responsible for sending the actual messages. If it crashes, the producer will not work properly, but your producer process will not automatically shut down, so you may not be aware of it. For consumers, the heartbeat thread related to rebalancing is also a thread that must be monitored. Its name starts with kafka-coordinator-heartbeat-thread.

In addition, there are some important JMX metrics for clients that can tell you their running status in real-time.

From a producer’s perspective, the JMX metric you should pay attention to is request-latency, which represents the latency of message production requests. This JMX metric directly reflects the TPS of the producer program. For consumers, records-lag and records-lead are two important JMX metrics. We explained the meanings of these two metrics in our column Lesson 22, so I won’t go into details here. In short, they directly reflect the consumer’s consumption progress. If you are using a consumer group, then there are two additional JMX metrics that you need to pay attention to: join rate and sync rate. They indicate the frequency of rebalancing. If their values are high in your environment, then you need to consider the reasons for frequent rebalancing.

Summary #

Alright, let’s summarize. Today, I introduced various aspects of monitoring Kafka. Besides monitoring the Kafka cluster, I also recommend monitoring from the perspective of the host and JVM. Monitoring the host is often the first step in locating and identifying problems. JVM monitoring is equally important. It’s important to know that many performance issues encountered by Java processes cannot be resolved by adjusting Kafka parameters. Lastly, I listed some important Kafka JMX metrics. In the next lecture, I will specifically explain how to use various tools to view these JMX metrics.

Open Discussion #

Please share your insights on monitoring Kafka and your operational tips.

Feel free to write down your thoughts and answers, let’s discuss together. If you find it helpful, you are also welcome to share this article with your friends.