04 Which Kafka Should I Choose

04 Which Kafka Should I Choose- #

In the previous article, we discussed the positioning of Kafka. Kafka is no longer just a messaging engine system, but a real-time streaming platform that can achieve exactly-once processing semantics.

You may have heard of Apache Storm, Apache Spark Streaming, or Apache Flink, which are all well-known names in the field of large-scale stream processing. It’s exciting to know that Kafka, after continuous iteration for so long, can now slightly rival these frameworks. I use the word “slightly” here to express both the respect of the Kafka community for these frameworks and the embarrassing situation in which few large companies in China currently use Kafka for stream processing. After all, Kafka has transformed from a messaging engine to a stream processing platform, and its performance in stream processing still needs to be verified over time.

If we expand our perspective from the stream processing platform to the stream processing ecosystem, Kafka still has a long way to go. As I mentioned earlier, Kafka Streams component provides the ability to process stream data in real-time using Kafka. However, there is another important component I haven’t mentioned, which is Kafka Connect.

When evaluating a stream processing platform, the performance of the framework itself and the richness of the provided operators are important evaluation criteria. However, the ability of the framework to interact with upstream and downstream systems is also crucial. The more external systems that can transfer data to it, the more robust its ecosystem will be. As a result, more people will be willing to use it, forming a positive feedback loop and continuously promoting the development of the ecosystem. As for Kafka, Kafka Connect connects upstream and downstream external systems through specific connectors.

The entire Kafka ecosystem is shown in the following diagram. It is worth noting that the external systems in this diagram are only a part supported by the Kafka Connect component. Currently, there is a pleasing trend of an increasing number of users using the Kafka Connect component, and I believe more and more people will develop their own connectors in the future.

After discussing so much, you might wonder how this is related to today’s topic. In fact, having a clear understanding of Kafka’s development trajectory and the current state of its ecosystem is very beneficial for guiding us to choose the right Kafka version. Now let’s move on to today’s topic - how to choose a Kafka version.

How many types of Kafka do you know? #

Huh? Isn’t Kafka an open-source framework? What do you mean by “types of Kafka”? In reality, there are indeed several types of Kafka. I’m not referring to its versions, but rather the existence of multiple organizations or companies that release different versions of Kafka. You must have heard of Linux distributions, such as CentOS, Red Hat, Ubuntu, and so on. They are all Linux systems, but why do they have different names? It’s because they are different Linux distributions released by different companies. Although there is no concept of distribution in the Kafka field, you can loosely think of the different Kafka versions on the market as different “distributions”.

Let me sort out these so-called “distributions” and explain how to choose among them. Of course, the term “distribution” is not strictly accurate when used in the context of the Kafka framework, but for the sake of distinguishing these different versions of Kafka, I will use it reluctantly here. Remember, when you discuss this topic with others in the future, it’s best not to mention the term “distribution” because it is unfamiliar in the Kafka ecosystem and may lead to ridicule.

1. Apache Kafka

Apache Kafka is the most “authentic” Kafka and should be the version you are most familiar with. Since the inception of Kafka as an open-source project, it has been incubated in the Apache Software Foundation and eventually graduated as a top-level project. It is also known as the community edition of Kafka. Our column uses this version of Kafka as the template for learning. More importantly, it serves as the foundation for all other versions mentioned later. In other words, the following versions either inherit Apache Kafka as-is or extend it with new features. Apache Kafka is the basis for studying and using Kafka.

2. Confluent Kafka

Let’s talk about Confluent first. In 2014, Kafka’s three founders, Jay Kreps, Neha Narkhede, and Jun Rao, left LinkedIn to start Confluent, a company focused on providing enterprise-level stream processing solutions based on Kafka. In January 2019, Confluent successfully raised $125 million in Series D funding, with a valuation of $2.5 billion, showing the favor of the capital market.

As an aside, Jun Rao is a Chinese person, an accomplished figure who graduated from Tsinghua University. We have seen an increasing number of Chinese people among the founders of top-level Apache projects. Another example is Apache Pulsar, a next-generation messaging engine system aimed at surpassing Kafka. There are also countless active Chinese contributors in the open-source community, which is truly inspiring.

Returning to Confluent, it mainly develops commercial Kafka tools and releases Confluent Kafka based on that. Confluent Kafka provides advanced features that Apache Kafka does not have, such as cross-datacenter replication, schema registry, and cluster monitoring tools.

3. Cloudera/Hortonworks Kafka

Cloudera’s CDH and Hortonworks’ HDP are well-known big data platforms that integrate mainstream big data frameworks, enabling users to achieve comprehensive data processing from distributed storage, cluster scheduling, stream processing to machine learning, and real-time databases. I know that many startups choose these two products when building their data platforms. Both CDH and HDP integrate Apache Kafka, so I refer to Kafka in these two products as CDH Kafka and HDP Kafka.

Of course, in October 2018, the two companies announced a merger to create a world-leading data platform. Perhaps CDH and HDP will also merge into a single product in the future, but one thing is certain: Apache Kafka will still be included and provided as part of the new data platform.

Feature Comparison #

Alright, now that we’ve covered these Kafka options on the market, let’s compare their advantages and disadvantages.

1. Apache Kafka

For Apache Kafka, it still has the largest developer community and the fastest version iteration speed among all the Kafka options. In the Top 5 developer ranking of mailing lists in the Apache Software Foundation in 2018, the Kafka community mailing list ranked second. If you encounter any problems when using Apache Kafka and submit them to the community, the community will respond in a timely manner. This is undoubtedly very user-friendly for ordinary Kafka users.

However, the disadvantage of Apache Kafka is that it only provides the most basic components. Specifically, for Kafka Connect mentioned earlier, the community version of Kafka provides only one connector, which is a connector for reading and writing disk files. It does not come with connectors for interacting with other external systems. In practice, you need to write code to implement such connectors, which is a disadvantage. Additionally, Apache Kafka does not provide any monitoring frameworks or tools. Obviously, it is not feasible to run Kafka in a production environment without monitoring. You will inevitably need to rely on third-party monitoring frameworks to monitor Kafka. The good news is that there are currently some open-source monitoring frameworks available to help monitor Kafka (such as Kafka Manager).

In short, if you only need a messaging engine system or a simple stream processing application, and you need a greater degree of control over the system, then I recommend using Apache Kafka.

2. Confluent Kafka

Now let’s take a look at Confluent Kafka. Confluent Kafka currently has two versions: the free version and the enterprise version. The former is very similar to Apache Kafka, but in addition to the regular components, the free version also includes two major features: Schema Registry and REST Proxy. The Schema Registry helps you centrally manage Kafka message formats to achieve data forwards/backwards compatibility. The REST Proxy allows you to access various Kafka features through an open HTTP interface, which Apache Kafka does not provide.

In addition, the free version includes more connectors, all of which are developed and certified by Confluent. You can use them for free. As for the enterprise version, it offers even more features. In my opinion, the most useful ones are cross-datacenter replication and cluster monitoring. Data synchronization between multiple datacenters and monitoring a cluster have always been pain points for Kafka. The Confluent Kafka enterprise version provides powerful solutions to help you “kill” these pain points.

However, one major drawback of Confluent Kafka is that Confluent currently does not have plans for developing its business in China. As a result, there is a lack of related materials and technical support, and many Chinese Confluent Kafka users cannot even find corresponding Chinese documentation. Therefore, the popularity of Confluent Kafka in China is relatively low at the moment.

In short, if you need to use some advanced features of Kafka, then I recommend using Confluent Kafka.

3. CDH/HDP Kafka

Finally, let’s talk about Kafka offered by big data cloud companies (CDH/HDP Kafka). These big data platforms naturally integrate Apache Kafka and unify the installation, operation, management, and monitoring of Kafka through a convenient UI interface. If you are a user of these platforms, you will find it very convenient because all operations can be done on the frontend UI without the need to execute complex Kafka commands. Additionally, the monitoring interfaces provided by these platforms are very user-friendly, and you don’t usually need to configure anything to effectively monitor Kafka.

However, every advantage has its drawbacks. The result of doing this is that it directly reduces your level of control over the Kafka cluster. After all, you know nothing about the underlying Kafka cluster, so how can you be aware of its status? Another disadvantage of this approach is its lag in keeping up with the latest Kafka versions. Due to its own release cycles, whether it can include the latest versions of Kafka in a timely manner becomes a problem. For example, when CDH 6.1.0 was released, Apache Kafka had already evolved to version 2.1.0, but Kafka in CDH was still at version 2.0.0. Obviously, those bugs fixed in Kafka 2.1.0 can only be fixed in CDH in the next version update.

In short, if you need to quickly set up a messaging engine system, or if you need to build a multi-framework data platform with Kafka as just one component, then I recommend using Kafka provided by these big data cloud companies.

Summary #

In summary, we discussed different “distributions” of Kafka today, as well as their pros and cons. Based on these pros and cons, we can choose the appropriate Kafka for our specific needs. In the next issue, I will take you through the development process of Kafka in various stages, so that we will have a basis for selecting Kafka features and lay a solid theoretical foundation before embarking on the path of Kafka application.

Finally, let’s review today’s content:

  • Apache Kafka, also known as the community version of Kafka. The advantage is that it has a fast iteration speed and a high level of community responsiveness, which allows for greater control; the drawback is that it only provides basic core components and lacks some advanced features.
  • Confluent Kafka, Kafka provided by Confluent. The advantage is that it integrates many advanced features and is built by the original team of Kafka, ensuring quality; the drawback is that the related documentation is not comprehensive, the popularity is relatively low, and there are not many reference examples available.
  • CDH/HDP Kafka, Kafka provided by big data cloud companies, with Apache Kafka embedded. The advantage is that it is easy to operate and saves operational costs; the drawback is that it has lower control and slower evolutionary speed.

Open Discussion #

Imagine you are an architect at a startup company. The company is planning to revamp its existing system and introduce Kafka as a messaging middleware to connect upstream and downstream businesses. As an architect, how would you choose the appropriate Kafka distribution?

Feel free to share your thoughts and questions, and let’s discuss together. If you find it helpful, please consider sharing this article with your friends.