02 an Article to Quickly Get You Up to Speed With Kafka Jargon

02 An Article to Quickly Get You Up to Speed with Kafka Jargon #

Hello, I’m Hu Xi. Today we officially embark on the journey of learning Apache Kafka.

In the world of Kafka, there are many concepts and terminologies that you need to understand and master in advance, which will be very beneficial for you to delve into Kafka’s various functions and features later on. Let me list various Kafka terminology below.

In the first issue of this column, I mentioned that Kafka belongs to a distributed messaging engine system, and its main function is to provide a complete set of solutions for message publishing and subscribing. In Kafka, the objects of publishing and subscribing are called topics. You can create dedicated topics for each business, each application, or even each type of data.

The client application that publishes messages to topics is called a producer. The producer program usually continuously sends messages to one or more topics, while the client application that subscribes to these topic messages is called a consumer. Similar to producers, consumers can also subscribe to messages from multiple topics. We collectively refer to producers and consumers as clients. You can run multiple instances of producers and consumers simultaneously, and these instances will continuously produce and consume messages to and from multiple topics in the Kafka cluster.

With clients, there are naturally servers. The Kafka server is composed of service processes called brokers. A Kafka cluster is composed of multiple brokers. Brokers are responsible for receiving and processing requests sent by clients and persisting messages. Although multiple broker processes can run on the same machine, it is more common to run different brokers on different machines. This way, even if a machine in the cluster crashes and all the broker processes running on it go down, the brokers on other machines can still provide services externally. This is actually one of the means by which Kafka provides high availability.

Another means to achieve high availability is replication. The idea of replication is simple, which is to copy the same data to multiple machines. These copies of the same data in Kafka are called replicas. Well, it seems that in the entire distributed system, they are all called replicas. The number of replicas can be configured. These replicas store the same data but have different roles and functions. Kafka defines two types of replicas: leader replicas and follower replicas. The former provides services externally, where external refers to interaction with client programs, while the latter only passively follows the leader replica and cannot interact with the outside world. Of course, you may know that in many other systems, follower replicas can provide services externally, such as the read operations handled by MySQL slaves. However, in Kafka, follower replicas do not provide services externally. By the way, something interesting is that the term “Master-Slave” is no longer advocated to refer to this master-slave relationship. After all, the term “Slave” has the meaning of slavery. In a country like the United States, which strictly prohibits racial discrimination, this expression is a bit politically incorrect. So, most systems have now been changed to use “Leader-Follower” instead.

The working mechanism of replicas is also simple: producers always write messages to the leader replica, while consumers always read messages from the leader replica. As for follower replicas, they only do one thing: send a request to the leader replica, requesting the leader to send the most recently produced messages to them, so that they can stay in sync with the leader.

Although the replication mechanism ensures data persistence and no message loss, it does not solve the scalability issue. Scalability is a very important and must-be-considered issue in distributed systems. What is scalability? Let’s take replicas as an example. Although we now have leader replicas and follower replicas, what if the leader replica accumulates too much data to the point where a single broker machine cannot accommodate it? What should we do then? A very natural idea is whether the data can be divided and stored in multiple brokers. If you think like this, then congratulations, that’s how Kafka is designed.

This mechanism is called partitioning. If you are familiar with other distributed systems, you may have heard of terms like sharding and partitioning, such as Sharding in MongoDB and Elasticsearch, and Region in HBase. In fact, they all follow the same principle, but partitioning is the most standard term. The partition mechanism in Kafka refers to dividing each topic into multiple partitions. Each partition is a group of ordered message logs. Each message produced by the producer is sent to only one partition. In other words, if a message is sent to a topic with two partitions, it will either go to partition 0 or partition 1. As you can see, Kafka partitions are numbered starting from 0. If a topic has 100 partitions, their partition numbers will range from 0 to 99.

At this point, you may wonder how the concept of replicas is related to partitions. In fact, replicas are defined at the partition level. Each partition can have multiple replicas, including one leader replica and N-1 follower replicas. Producers write messages to partitions, and the position of each message within the partition is represented by an offset. Partition offsets always start from 0. For example, if a producer writes 10 messages to an empty partition, the offsets of these 10 messages will be 0, 1, 2, …, 9.

Now let’s connect the three layers of Kafka’s messaging architecture:

The first layer is the topic layer, where each topic can be configured with M partitions, and each partition can be configured with N replicas.
The second layer is the partition layer, where only one replica serves as the leader and provides services to clients, while the other N-1 replicas are follower replicas that provide data redundancy.
The third layer is the message layer, where each partition contains multiple messages, and each message is assigned a unique offset starting from 0.
Finally, client programs can only interact with the leader replica of a partition.

Now let’s talk about how Kafka brokers persist data. In general, Kafka uses logs to store data. A log is a physical file on disk that only allows append-only writes. By avoiding slow random I/O operations and using efficient sequential I/O writes, Kafka achieves high throughput. However, if you keep writing messages to a log without deletion, it will eventually consume all available disk space. Therefore, Kafka regularly deletes messages to free up disk space. How does deletion work? It is achieved through the mechanism of log segments. In Kafka, each log is further divided into multiple log segments. Messages are appended to the latest log segment, and when a log segment is full, Kafka automatically creates a new log segment and archives the old one. Kafka also has background tasks that periodically check if old log segments can be deleted to reclaim disk space.

Now let’s focus on consumers. In the first article of this column, I mentioned two messaging models: point-to-point (P2P) and publish-subscribe. In the P2P model, the same message can only be consumed by one consumer downstream, and other consumers cannot access it. In Kafka, this P2P model is implemented by introducing consumer groups. A consumer group is a group of multiple consumer instances that together consume a set of topics. Each partition in this set of topics is consumed by only one consumer instance in the group, and other consumer instances cannot consume it. Why introduce consumer groups? It is mainly to increase consumer-side throughput. By allowing multiple consumer instances to consume messages simultaneously, the overall throughput of the consumer side is accelerated. I will explain the consumer group mechanism in detail later, so for now, you just need to understand what consumer groups are for. Additionally, a consumer instance can be a process running a consumer application or a thread; they are all referred to as a consumer instance.

All consumer instances within a consumer group not only “share” the subscription topics’ data but also assist each other. Let’s say one instance in the group crashes. Kafka can detect it automatically and transfer the partitions previously handled by the failed instance to other live consumers. This process is known as “rebalance” in Kafka, which is famous but notorious because it often causes consumer problems. In fact, the community is currently overwhelmed with resolving rebalance-related bugs.

Each consumer needs a field to record its current position when consuming messages. This field is called the consumer offset. Note that this is different from the offset mentioned above. The offset mentioned earlier represents the position of a message within a partition, and it is immutable. Once a message is successfully written to a partition, its offset value remains constant. However, the consumer offset is different. It can change at any time because it indicates the consumer’s progress. It is important to differentiate between these two types of offsets. I personally refer to the message’s position within a partition as the partition offset and the consumer-side offset as the consumer offset.

Summary #

Let me summarize all the terms mentioned today:

Record: Message. Kafka is a message engine, and the term “message” refers to the main object processed by Kafka.
Topic: Topic. A topic is a logical container that carries messages and is commonly used to differentiate specific business scenarios.
Partition: Partition. An ordered and immutable sequence of messages. Each topic can have multiple partitions.
Offset: Offset. Indicates the position information of each message in a partition, which is a monotonically increasing and immutable value.
Replica: Replica. In Kafka, the same message can be copied to multiple locations to provide data redundancy, and these locations are called replicas. Replicas can be further divided into leader replicas and follower replicas, with different role assignments. Replicas are configured at the partition level, so each partition can have multiple replicas to achieve high availability.
Producer: Producer. An application that publishes new messages to a topic.
Consumer: Consumer. An application that subscribes to new messages from a topic.
Consumer Offset: Consumer Offset. Represents the consumer’s progress in consuming messages, with each consumer having its own consumer offset.
Consumer Group: Consumer Group. A group composed of multiple consumer instances that consume multiple partitions simultaneously to achieve high throughput.
Rebalance: Rebalance. The process of automatically reassigning topic partitions among consumer instances in a consumer group when a consumer instance fails. Rebalance is an important means for Kafka consumer to achieve high availability.

Finally, I use a diagram to illustrate all these concepts mentioned above. I hope this diagram helps you better understand all these concepts:

Open Discussion #

Please consider the reason why Kafka does not allow follower replicas to provide read services to the outside world, similar to MySQL.

Feel free to share your thoughts and answers, let’s discuss together. If you find it helpful, you’re also welcome to share this article with your friends.