03 Is Kafka Just a Message Broker System

03 Is Kafka Just a Message Broker System- #

Hello, I am Hu Xi. Today, let’s talk about a well-known topic: Is Kafka just a messaging engine system?

To understand this question, it is inevitable to know the development history of Apache Kafka. Sometimes, we may think that it is unnecessary to understand the past and present of a system or framework, and it is faster and better to directly start learning the specific technologies. However, regardless of which technology you are learning, diving directly into the specific details or starting from a small point will quickly make you feel bored. Why is that? Because although you quickly master a certain technical detail, you cannot establish a global cognitive perspective. This will result in progress only in individual points, but unable to connect them into a line and then expand them into a whole, achieving systematic learning.

I say this based on my own experience, as this was how I initially learned Kafka. You may not believe it, but I started reading the Kafka source code from the utils package. Obviously, we don’t need to look at the source code to know what this thing is used for, right? It’s just a utility package, and this way of reading source code is extremely inefficient. As I said, I was learning point by point, but after learning everything, I didn’t feel any sense of understanding Kafka at all because I didn’t know what effect the code in these packages could achieve when combined together. So I say it is a very inefficient learning method.

Later, I changed my learning method and began to understand Kafka from a top-down perspective. Surprisingly, I discovered many things that I had overlooked during the previous learning process. What’s more, I found that this learning method can help me maintain interest in learning for a longer period of time without feeling bored in stages. Especially during the process of understanding the entire development history of Apache Kafka, I happily acquired a lot of knowledge and experience about running large open-source software communities. This can be regarded as a great gain beyond technology.

Looking at the development trajectory of Kafka, it did start as a messaging engine, but as the title of the article asks, is Apache Kafka really just a messaging engine? Typically, before answering this question, many articles may need to discuss what a messaging engine is and what it can do. But forget it, I’ll just give you the answer without starting from the very beginning. The answer to this question is, Apache Kafka is a messaging engine system, and it is also a distributed streaming platform. If you read the entire text but can only remember one sentence, I hope it is this one. To emphasize once again, Kafka is a messaging engine system and a distributed streaming platform.

As we all know, Kafka is a project incubated within LinkedIn. According to my conversations with the Kafka founding team members and the publicly available information I found, LinkedIn initially had a strong demand for real-time data processing, and many of its internal subsystems needed to perform various types of data processing and analysis, including business system and application performance monitoring, as well as user behavior data processing, and so on.

At that time, they encountered the main problems:

  • Insufficient data correctness. Because data collection mainly used a polling approach, determining the polling interval time became a highly experiential matter. Although heuristic algorithms could be used to help evaluate the interval time value, if specified improperly, it would inevitably result in significant data bias.
  • Highly customized and high maintenance cost of the system. Each business subsystem needed to interface with the data collection module, which introduced a lot of customization overhead and manual cost.

In order to solve these problems, LinkedIn engineers attempted to use ActiveMQ, but the results were not satisfactory. Obviously, there was a need for a “unified” system to replace the existing working methods, and this system was Kafka.

From its birth, Kafka appeared in the public eye as a messaging engine system. If you look at the official documentation before version 0.10.0.0, you will find that the Kafka community clearly positioned it as a distributed, partitioned, and backup-enabled commit log service.

Here, let me digress for a moment. You may be curious about the origin of the name Kafka. In fact, one of the Kafka authors, Jay Kreps, has talked about the reason for the name.

Because Kafka has strong write performance, naming it after an author seemed like a good idea. I took many literature classes during college and really liked the writer Franz Kafka, and the name sounded cool for an open-source software. To get to the point, Kafka was initially designed to provide three key features:

  • Provide a set of APIs to implement producers and consumers.
  • Reduce network transmission and disk storage costs.
  • Implement a highly scalable architecture.

In the courses that follow this column, we will gradually explore how Kafka achieves these three points. In short, as Kafka continued to improve, Jay and other experts realized that open sourcing it to benefit more people was a great idea. Therefore, in 2011, Kafka officially entered the Apache incubator and successfully graduated as an Apache top-level project in October of the following year.

After being open sourced, Kafka was increasingly adopted by more companies in their internal data pipelines, especially in the field of big data engineering. Kafka played an important role in connecting and processing data streams between upstream and downstream systems. This usage pattern was so common that it led the Kafka community to think: instead of passing data from one system to another for processing, why not implement a stream processing framework of our own? Based on this consideration, the Kafka community officially introduced the stream processing component Kafka Streams in version 0.10.0.0. It was from this version onwards that Kafka “transformed” into a distributed stream processing platform, not just a messaging engine system. Today, Apache Kafka is a real-time stream processing platform on par with Apache Storm, Apache Spark, and Apache Flink.

Admittedly, the understanding of Kafka as a stream processing platform is not yet widespread in China, and its core stream processing component, Kafka Streams, is rarely used by major companies. However, we are delighted to see that with the vigorous promotion by experts at Kafka summits, there are now numerous examples of building stream processing platforms using Kafka. More and more companies are also interested in and willing to use Kafka Streams. Therefore, personally, I am very optimistic about the future of Kafka as a stream processing platform.

You may have the question: as a stream processing platform, what advantages does Kafka have compared to other mainstream big data stream processing frameworks? I can think of two points.

The first point is that it is easier to achieve end-to-end correctness. Tyler, the famous expert at Google, once said that for stream processing to eventually replace its “brother” batch processing, it needs to have two core advantages: the ability to achieve correctness and tools that can infer time. Achieving correctness is the cornerstone of stream processing’s ability to match batch processing. Correctness has always been a strength of batch processing, and the cornerstone of achieving correctness is being able to provide exactly-once processing semantics, where each message is processed only once and has the opportunity to affect the system state. Currently, mainstream big data stream processing frameworks claim to have achieved exactly-once processing semantics, but this is conditional—only within the framework itself, as they cannot achieve end-to-end semantics when combined with external messaging systems.

Why is this the case? Because when these frameworks are used in conjunction with external messaging systems, they cannot influence the processing semantics of the external systems. So, if you set up an environment where Spark or Flink read messages from Kafka for stateful data computation and then write back to Kafka, you can only guarantee that within Spark or Flink, the impact of this message on the state occurs only once. However, the computed result may be written to Kafka multiple times because these frameworks cannot control Kafka’s processing semantics. On the contrary, Kafka is not like this because all data streaming and computing are done within Kafka itself, which allows Kafka to achieve end-to-end exactly-once processing semantics.

The second point that might help Kafka stand out is its positioning in stream processing. The official website clearly states that Kafka Streams is a client library for building real-time stream processing, not a complete functional system. This means that you cannot expect Kafka to provide out-of-the-box features such as cluster scheduling and elastic deployment. You need to choose suitable tools or systems to help implement these operational features for Kafka stream processing applications.

You may be thinking, how does this count as an advantage? Frankly speaking, this is indeed a “double-edged sword” design and a deliberate consideration by the Kafka community to sidestep a direct competition with other stream processing frameworks. Large companies’ stream processing platforms must be deployed on a large scale, so having cluster scheduling functionality and flexible deployment options are indispensable elements. However, there are still many small and medium-sized enterprises in the world, whose stream processing data volume is not huge, and the logic is not complex. Deploying a heavyweight, complete platform would be overkill. This is where Kafka’s stream processing component shines. Therefore, from this perspective, Kafka should have a place in the future of stream processing frameworks.

Apart from being a messaging engine and a stream processing platform, does Kafka have any other uses? Of course! Can you imagine that Kafka can be used as a distributed storage system? One of Kafka’s authors, Jay Kreps, has written an article specifically explaining why Kafka can be used as a distributed storage system. But I think it’s enough for you to have a basic understanding of it. I have never seen anyone use Kafka as a persistent storage system in a real production environment.

After saying so much, I just want to emphasize one point: Apache Kafka started as an excellent messaging engine system and gradually evolved into a distributed stream processing platform. You need to not only be proficient in its exceptional features and usage techniques as a messaging engine system, but it is also preferable to have a good understanding of the design and application cases of its stream processing component.

Open Discussion #

What do you think about the future direction of Kafka’s evolution? If you were the “steering person” of the Kafka community, what direction would you lead the entire community towards? (Hint: You can imagine yourself as Linus and then contemplate.)

Feel free to write down your thoughts and answers, and let’s discuss together. If you find it valuable, please also consider sharing this article with your friends.