40 the Differences Between Kafka Streams and Other Streaming Platforms

40 The Differences Between Kafka Streams and Other Streaming Platforms #

Hello, I’m Hu Xi. Today I want to share with you the topic of the differences between Kafka Streams and other streaming platforms.

In recent years, there have been many excellent frameworks emerging in the open-source streaming processing field. Just within the Apache Foundation’s incubated projects, there are more than a dozen big data frameworks related to stream processing, such as the early Apache Samza, Apache Storm, as well as the popular Spark and Flink in recent years.

It should be said that each framework has its own unique features and also its own shortcomings. Faced with this multitude of streaming processing frameworks, how should we choose? Today, I will summarize several mainstream streaming platforms and focus on analyzing the differences between Kafka Streams and other streaming platforms.

What is a streaming processing platform? #

First of all, it is necessary to understand the concept of a streaming processing platform. According to the book “Streaming Systems,” a streaming processing platform is defined as a data processing engine for handling unbounded datasets, with streaming processing being the opposite of batch processing.

Unbounded datasets refer to data that never ends. A streaming processing platform is a system or framework specifically designed to handle this type of dataset. This is not to say that batch processing systems cannot handle unbounded datasets, but they are typically more efficient at handling bounded datasets.

So, how do we differentiate between streaming processing and batch processing? The following diagram should help you understand the difference quickly and intuitively.

Now, let me explain in detail the difference between streaming processing and batch processing.

Traditionally, streaming processing was known for its low latency but less accurate results. It can calculate results with every incoming message, but since it primarily processes unbounded data, the processing may never end. As a result, it becomes difficult to accurately describe when the results in streaming processing are precise. In theory, the calculated results in streaming processing continuously approximate the accurate results.

On the other hand, batch processing, which is the opposite of streaming processing, provides accurate results but often comes with higher latency.

To leverage the strengths of both approaches, industry experts have combined them. They use streaming processing to quickly provide less accurate results and rely on batch processing to achieve data consistency. This is known as the Lambda Architecture.

Low latency is a great feature, but if the computed results are not accurate, streaming processing cannot fully replace batch processing. The concept of accurate results also has a special name in textbooks or literature - it’s called correctness. It can be said that the difficulty of achieving correctness is the biggest obstacle for streaming processing to replace batch processing. The foundation of achieving correctness lies in the concept of Exactly Once Semantics (EOS).

Here, “exactly once” refers to a class of consistency guarantees that a streaming processing platform can provide. There are three common types of consistency guarantees:

At most once semantics: The impact of messages or events on the application state occurs at most once.
At least once semantics: The impact of messages or events on the application state occurs at least once.
Exactly once semantics: The impact of messages or events on the application state occurs exactly once.

Note that I am referring to the impact on the application state. For many operations with side effects, achieving exactly once semantics is almost impossible. For example, let’s say one step in your streaming processing is sending an email. Once the email is sent, it cannot be retrieved if there are any issues later in the processing flow that require rolling back the entire streaming process. This is known as side effects. When your streaming processing logic includes operators with side effects, it is not possible to guarantee exactly once processing of those operators. Therefore, we usually only ensure the impact of such operations on the application state occurs exactly once. We will discuss in detail later about how Kafka Streams achieves EOS.

When we talk about streaming processing today, it includes both real-time streaming processing and micro-batch processing. Micro-batch processing refers to the repetitive execution of batch processing engines to handle unbounded datasets. A typical platform for implementing micro-batch processing is Spark Streaming.

Features of Kafka Streams #

Compared to other stream processing platforms, the biggest feature of Kafka Streams is that it is not a platform, or at least not a full-fledged platform with features such as a built-in scheduler and resource manager that other frameworks provide, which Kafka Streams does not.

The Kafka website clearly defines Kafka Streams as a Java client library. “You can use this library to build distributed applications and microservices that are highly scalable, elastic, and fault-tolerant.”

Applications built using the Kafka Streams API are just regular Java applications. You can choose any familiar technology or framework to compile, package, deploy, and go live with them.

In my opinion, this is the biggest difference between Kafka Streams and Storm, Spark Streaming, or Flink.

The positioning of Kafka Streams as a Java client library can be seen as a feature or a drawback. One important reason for the slow promotion of Kafka Streams in China is precisely because of this. After all, many companies hope that it is a feature-complete platform that can provide both stream processing application APIs and capabilities for managing and scheduling cluster resources. So, whether this positioning is a feature or a drawback depends on individual perspectives.

Differences between Kafka Streams and other frameworks #

Next, I will analyze the differences between Kafka Streams and other frameworks in terms of application deployment, upstream and downstream data sources, coordination methods, and semantic guarantees.

Application Deployment #

First, let’s distinguish Kafka Streams from other frameworks in terms of application deployment.

As mentioned earlier, Kafka Streams applications need to be packaged and deployed by developers themselves. You can even embed Kafka Streams applications into other Java applications. Therefore, as a developer, you need to not only develop the code but also manage the lifecycle of the Kafka Streams application. You can either package it as a standalone JAR file or embed the stream processing logic into a microservice for other services to call.

However, regardless of the deployment method, you need to handle it yourself, and you can’t expect Kafka Streams to do these things for you.

In contrast, other stream processing platforms provide complete deployment solutions. Let’s take Apache Flink as an example to explain. In Flink, stream processing applications are modeled as individual stream processing logic and encapsulated into Flink jobs. Similarly, Spark also has the concept of jobs, while Storm uses the term topology. The lifecycle of a job is managed by the framework, especially in Flink, where Flink framework is responsible for managing the job, including deployment and updates of the job. This does not require the intervention of application developers.

In addition, frameworks like Flink have a role called the Resource Manager. The resources required by a job are fully supported by the resource manager at the framework level. Common resource managers such as YARN, Kubernetes, Mesos, etc., are supported by newer stream processing frameworks (such as Spark, Flink, etc.). Frameworks like Spark and Flink also support the Standalone cluster mode, which means they do not rely on any existing resource manager and manage resources entirely by themselves. These are things that Kafka Streams cannot provide.

Therefore, from the perspective of application deployment, Kafka Streams tends to leave the deployment to developers instead of relying on the framework to implement it.

Upstream and Downstream Data Sources #

After discussing the differences in deployment methods, let’s talk about the differences in connecting with upstream and downstream data sources. Simply put, Kafka Streams currently only supports reading from and writing to Kafka. Without the support of the Kafka Connect component, Kafka Streams can only read topic data from the Kafka cluster and can only write the results back to Kafka topics after completing the stream processing logic.

In contrast, frameworks like Spark Streaming and Flink integrate many rich upstream and downstream data source connectors, such as common connectors for MySQL, ElasticSearch, HBase, HDFS, Kafka, etc. When using these frameworks, you can easily integrate these external frameworks without any secondary development.

Of course, since developing a connector usually requires a good understanding of both the stream processing framework and the external framework, the quality of the connectors varies. When using them, you can check the corresponding jira official website for any obvious “pitfalls” before deciding whether to use them.

I have learned my lesson in this regard. I once used a connector and found that it seemed to consume messages from Kafka and write them to other systems repeatedly. After a lot of twists and turns, I found that this was a known bug that had been recorded on the jira official website for a long time. Therefore, I recommend you to browse through jira more often, maybe you can avoid some “pitfalls” in advance.

In summary, currently, Kafka Streams only supports interaction with Kafka clusters and does not provide ready-to-use external data source connectors.

Coordination Methods #

In terms of distributed coordination, Kafka Streams applications rely on the coordination capabilities provided by the Kafka cluster to provide high fault tolerance and scalability.

Kafka Streams applications use the consumer group mechanism to achieve arbitrary stream processing scaling. Each instance or node of the application is essentially an independent consumer within the same consumer group, and they do not affect each other. The coordination work between them is completed by the coordinator component on the Kafka cluster broker. When instances are added or exiting, the coordinator automatically detects and redistributes the load.

I have drawn a diagram to show the internal structure of each Kafka Streams instance. From this diagram, we can see that each instance is composed of a consumer instance, specific stream processing logic, and a producer instance. The consumer instances in these instances collectively form a consumer group.

Through this mechanism, Kafka Streams applications achieve high scalability and fault tolerance, and all of this is automatically provided without the need for manual implementation.

On the other hand, frameworks like Flink coordinate the fault tolerance and scalability through dedicated master nodes.

Flink supports high availability of master nodes through ZooKeeper to avoid single point of failure: if a node fails, the recovery operation is triggered automatically. This global coordination model is very useful for jobs in stream processing, but it is not very suitable for standalone stream processing applications. The reason is that it is not as lightweight as Kafka Streams, and the application must implement specific APIs to initiate checkpointing and also need to participate in the error recovery process themselves.

In different scenarios, Kafka Streams and heavyweight coordination models like Flink have their own advantages and disadvantages.

Message Semantics Guarantee #

We just mentioned EOS (Exactly Once Semantics). Currently, many stream processing frameworks claim to have implemented EOS, including Kafka Streams itself. However, there are some clarifications regarding the precise once processing semantics.

In reality, when using Spark, Flink, and Kafka together, if the idempotent producer and transactional producer introduced in Kafka 0.11.0.0 are not used, these frameworks cannot achieve end-to-end EOS.

Since these frameworks are independent of Kafka, there are no semantic guarantee mechanisms between them. However, the situation is different when using transactional mechanisms. These external systems utilize Kafka’s transaction mechanism to ensure the end-to-end EOS from reading messages from Kafka to computing and then writing them back to Kafka. This is known as the end-to-end exactly once processing semantics.

The previously claimed EOS by Spark and Flink was implemented within their respective frameworks and cannot achieve the end-to-end EOS. Only by utilizing Kafka’s transaction mechanism can their corresponding connectors potentially support the end-to-end exactly once processing semantics.

The Spark official website explicitly states that to achieve EOS with Kafka, users must ensure that idempotent output and offset saving are in the same transaction. If you cannot implement this mechanism yourself, then you have to rely on the transaction mechanism provided by Kafka for guarantee.

Regarding Flink, although it also claimed to provide EOS before Kafka 0.11, there are some prerequisites, namely each message has one and only one impact on the Flink application state.

For example, if you use Flink to read messages from Kafka and directly write them to MySQL without any processing, this operation is stateless, and Flink cannot guarantee end-to-end EOS.

In other words, the Kafka messages written into MySQL by Flink may be duplicates. Of course, since version 1.4, the Flink community has officially implemented end-to-end EOS, and its basic design idea is based on the two-phase commit mechanism of Kafka’s idempotent producer introduced in Kafka 0.11.

The two-phase commit (2PC) mechanism is a distributed transaction mechanism used to achieve atomic submission of transactions across multiple nodes in a distributed system. The following figure from the amazing book “Designing Data-Intensive Applications” illustrates the process of a successful 2PC. In this figure, two databases participate in the distributed transaction submission process and make some changes individually. Now, they need to use 2PC to ensure that the changes in both databases are atomically submitted. As shown in the figure, the 2PC is divided into two phases: the Prepare phase and the Commit phase. Only when these two phases are fully executed can the distributed transaction be considered successfully submitted.

2PC

The 2PC in distributed systems is commonly used for internal implementation within databases or for heterogeneous systems using the XA transaction format. Kafka also drew from the idea of 2PC and implemented a transaction mechanism based on 2PC internally.

However, the situation is different for Kafka Streams. It naturally supports end-to-end EOS because it is closely integrated with Kafka.

The following figure shows the execution logic of a typical Kafka Streams application.

Kafka Streams Application Execution

Usually, a Kafka Streams application needs to perform five steps:

Read the latest processed message offset.
Read the message data.
Execute the processing logic.
Write the processing result back to Kafka.
Save the position information.

The execution of these five steps must be atomic, otherwise, it is impossible to achieve exactly once processing semantics.

In terms of design, Kafka Streams extensively uses Kafka’s transaction mechanism and idempotent producer to achieve atomic writes for multiple partitions. Furthermore, since Kafka Streams only reads and writes to Kafka, it easily achieves end-to-end EOS.

In summary, although Flink has also provided EOS with Kafka since version 1.4, considering adaptability, it can be said that Kafka Streams has the best compatibility with Kafka.

Summary #

Alright, let’s summarize. Today, I focused on the differences between Kafka Streams and other stream processing frameworks or platforms. In general, Kafka Streams is a lightweight client library, while other stream processing platforms are fully-featured stream processing solutions. This is the unique feature of Kafka Streams but it may also be considered a limitation. However, I believe that in many cases we don’t need a heavyweight stream processing solution, and using a lightweight library API to help us achieve real-time computations is very convenient. I think this may be the future breakthrough for Kafka Streams.

In the following content of this column, I will provide detailed instructions on how to use the Kafka Streams API to implement real-time computations, and share with you a practical case study. Hopefully, these will inspire your interest in Kafka Streams and lay the foundation for your future exploration.

Open Discussion #

There is a “soul-searching” question on Zhihu about Kafka Streams: Why are there not many users of Kafka Streams? I recommend you to take a look and share your understanding and answer to this question.

Feel free to write down your thoughts and answers, let’s discuss together. If you find it helpful, you can also share the article with your friends.