11 How to Implement No Message Loss Configuration

weight: 12

11 How to Implement No Message Loss Configuration #

Hello, I’m Hu Xi. Today I want to share with you the topic of how to configure Kafka to avoid message loss.

Many people have their own understanding and solutions when it comes to message loss in Kafka. Before discussing specific countermeasures, I think it is important to first clarify what constitutes message loss in the Kafka world, or under what circumstances Kafka can guarantee that messages are not lost. This is crucial because we often easily confuse the boundaries of responsibility. If we don’t understand who is responsible for what, we naturally won’t know who should provide a solution.

So under what circumstances can Kafka guarantee that messages are not lost?

In a nutshell, Kafka only provides limited persistence guarantees for “committed” messages.

There are two key elements in this sentence, let’s look at them one by one.

The first key element is " committed messages “. What are committed messages? When several brokers in Kafka successfully receive a message and write it to the log file, they will inform the producer program that the message has been successfully committed. At this point, the message becomes a “committed” message in Kafka’s view.

Why is it several brokers? It depends on how you define “committed”. You can choose to consider a message as committed as long as it is successfully saved by one broker, or you can require all brokers to successfully save the message before considering it as committed. Regardless of the situation, Kafka’s guarantee of persistence only applies to committed messages.

The second key element is " limited persistence guarantee “, which means that Kafka cannot guarantee that messages are not lost under all circumstances. Let’s take an extreme example. If the earth no longer exists, can Kafka still save any messages? Obviously not! If you still want Kafka to not lose messages in such a situation, you can only deploy Kafka Broker servers on another planet.

Now you should be able to understand the meaning of “limited” here. It means that Kafka’s guarantee of not losing messages has certain prerequisites. If your messages are saved on N Kafka Brokers, the prerequisite is that at least one of these N brokers is alive. As long as this condition is met, Kafka can guarantee that your message will never be lost.

To summarize, Kafka can achieve message loss prevention, but these messages must be committed messages and must meet certain conditions. Of course, explaining this is not about absolving Kafka’s responsibility, but to clarify the boundaries of responsibility when such issues arise.

“Lost Messages” Scenarios #

Okay, now that you understand how Kafka ensures message durability, let’s take a look at some common “Kafka message loss” scenarios. Note that here we are talking about “message loss” within quotation marks, because sometimes we wrongly accuse Kafka for such occurrences.

Scenario 1: Producer Program Losing Data

One of the most complained-about scenarios of data loss is when a producer program loses messages. Let me describe a situation for you: you write a producer application to send messages to Kafka, and later on, you realize that Kafka did not save those messages. So you angrily exclaim, “Kafka is terrible! It can’t even handle message sending properly, and it doesn’t even inform me?!” If you have experienced this before, please calm down for a moment, and let’s analyze some possible reasons.

Currently, Kafka producers send messages asynchronously. This means that if you use the producer.send(msg) API, it usually returns immediately. However, you cannot assume that the message has been successfully sent at this point.

This sending method has an interesting name: “fire and forget”. It originates from the missile guidance field but has been borrowed into the computer field. Its meaning is to perform an operation and not worry about the result. Calling producer.send(msg) is a typical example of “fire and forget”. Therefore, if a message is lost, we will not know about it. This sending method seems unreliable, but some companies indeed use this API for message sending.

What factors could lead to the failure of messages being sent using this method? In fact, there are many reasons, such as network instability, which prevents the messages from being sent to the broker; or the messages themselves are not qualified and the broker rejects them (for example, the message is too large and exceeds the broker’s capacity). From this perspective, it is a bit unfair to blame Kafka for these situations. As mentioned earlier, Kafka does not consider the messages committed, so the notion of Kafka losing messages does not exist.

However, even if Kafka is not to blame, we still need to address this issue. In reality, the solution is very simple: producers should always use the sending API with callback notifications, which means not using producer.send(msg) but producer.send(msg, callback) instead. Do not underestimate the importance of the callback, as it accurately tells you whether the message has been successfully committed. Once a message fails to be committed, you can handle it accordingly.

For example, if it is due to momentary errors, simply let the producer retry; if it is due to the message being unqualified, adjust the message format and resend it. In short, the responsibility for handling failed message sending lies with the producer, not the broker.

You may ask, is it really impossible for the failure to send to be caused by issues on the broker side? Of course, it is possible! If all your brokers go offline, no matter how many times the producer retries, it will fail. In such cases, you need to quickly address the issue on the broker side. However, the core argument mentioned earlier still holds true: Kafka still does not consider the message as committed, so it does not provide any guarantee of persistence for it.

Scenario 2: Consumer Program Losing Data

Message loss on the consumer side mainly refers to the situation where the messages that the consumer program needs to consume are missing. The consumer program has a concept called “offset”, which represents the position of the consumer’s current consumption in a topic partition. The following image is taken from the official website and clearly shows the offset data on the consumer side.

For example, for Consumer A, its current offset value is 9; Consumer B’s offset value is 11.

Here, “offset” is similar to a bookmark we use when reading a book. It marks how many pages we have read, so that we can directly jump to the bookmark page and continue reading when we flip the book next time.

There are two steps to correctly use bookmarks: the first step is to read the book, and the second step is to update the bookmark page. If the order of these two steps is reversed, a scenario like this may occur: the current bookmark page is 90, and I first put the bookmark on page 100 and then start reading. When I stop reading temporarily at page 95, I have an interruption. So, the problem is, when I jump directly to the bookmark page to read next time, I lose the content of pages 96 to 99, which means these messages are lost.

Similarly, this is what happens when messages are lost on the Consumer side in Kafka. To combat this message loss, the solution is simple: maintain the order of consuming messages (reading) first and updating the offset (bookmark) later. This way, we can maximize the guarantee that messages are not lost.

Of course, this processing method may cause the problem of duplicate message processing, similar to reading the same page of a book many times, but this does not belong to the situation of message loss. In the content later in the column, I will share with you how to deal with the problem of duplicate consumption.

In addition to the above-mentioned scenarios, there is actually a more hidden message loss scenario.

We still use reading a book as an example. Suppose you spend money to rent an e-book with 10 chapters from the internet. The e-book has a valid reading time of 1 day, and after it expires, you cannot open it anymore. But if you complete the reading within 1 day, you will get a refund for the rent.

To speed up the reading process, you entrust each of the 10 chapters of the book to your 10 friends, asking them to help you read and tell you the main idea of each chapter. When the e-book is about to expire, these 10 people tell you that they have finished reading their respective chapters, so you confidently return the book. Unexpectedly, when these 10 people describe the main idea to you, you suddenly find that one person lied to you and did not finish reading the chapter they were responsible for. Obviously, you cannot know the content of that chapter.

For Kafka, this is similar to the Consumer program asynchronously processing messages with multiple threads and the Consumer program automatically updating the offset. If one of the threads fails to run and the message it was supposed to handle is not successfully processed, but the offset has already been updated, then this message is actually lost to the Consumer.

The key here is that the Consumer automatically commits the offset, which is similar to you returning the book without confirming that all the book’s contents have been fully read. You blindly update the offset without truly confirming whether the message has indeed been consumed.

The solution to this problem is also simple: if you are processing messages with multiple threads asynchronously, the Consumer program should not enable automatic offset commit, but instead, the application program should manually commit the offset. Here I would like to remind you that it is easy to talk about using multiple threads in a single Consumer program to consume messages, but it is extremely difficult to write the code correctly because it is challenging to handle the update of the offset correctly. In other words, avoiding message loss without consumption is simple, but it is very easy for messages to be consumed multiple times.

Best practices #

After reading these two cases, let me share the Kafka configuration for zero message loss. Each of them can correspond to the problems mentioned above.

Do not use producer.send(msg); instead, use producer.send(msg, callback). Remember to use the send method with a callback notification.
Set acks = all. acks is a parameter of the Producer and represents your definition of “committed” messages. If set to all, it means all replica brokers must receive the message for it to be considered “committed”. This is the highest level of “committed” definition.
Set retries to a large value. retries is also a parameter of the Producer and corresponds to the automatic retry mentioned earlier. When there is a momentary network fluctuation, message sending may fail. However, a Producer configured with retries > 0 can automatically retry message sending to avoid message loss.
Set unclean.leader.election.enable = false. This is a parameter on the Broker side and controls which Brokers are eligible to compete for leadership of a partition. If a Broker lags behind the original Leader too much, once it becomes the new Leader, it will inevitably result in message loss. Therefore, this parameter is generally set to false, disallowing this situation from occurring.
Set replication.factor >= 3. This is also a parameter on the Broker side. Essentially, it means it is best to keep multiple copies of messages. After all, the main mechanism to prevent message loss currently is redundancy.
Set min.insync.replicas > 1. This is still a parameter on the Broker side and controls how many replicas the message must be written to in order to be considered “committed”. Setting it greater than 1 can improve message durability. In actual environments, never use the default value of 1.
Ensure replication.factor > min.insync.replicas. If they are equal, as long as one replica goes down, the entire partition will not function properly. We not only want to improve message durability and prevent data loss, but also achieve this without compromising availability. It is recommended to set replication.factor = min.insync.replicas + 1.
Ensure message consumption is completed before committing. The Consumer has a parameter enable.auto.commit, and it is best to set it to false and use manual offset commit. As mentioned earlier, this is crucial for scenarios with a single Consumer and multiple threads.

Summary #

Today, we discussed all aspects of Kafka’s guarantee of no message loss. We started by explaining what message loss is and clarified the responsibility boundary of Kafka’s persistence guarantee. Then, we used this rule as a benchmark to measure some common scenarios of data loss. Finally, by analyzing these scenarios, I provided the “best practices” for achieving no message loss in Kafka. In summary, I hope you have two takeaways from today:

Understand the meaning and limitations of Kafka’s persistence guarantee.
Become proficient in configuring Kafka’s parameters to ensure no message loss.

Open Discussion #

In fact, Kafka has another particularly secretive scenario of message loss: adding topic partitions. When new partitions are added, there is an “unfortunate” time interval where the Producer perceives the addition of partitions before the Consumer does. The Consumer is set to read messages from the “latest offset”, so before the Consumer becomes aware of the new partitions, the messages sent by the Producer during this time interval are all “lost” or, in other words, cannot be read by the Consumer. Strictly speaking, this is a small flaw in Kafka’s design. Do you have any solutions?

Feel free to write down your thoughts and answers. Let’s discuss together. If you find it helpful, you’re welcome to share the article with your friends.