Special Release Four Detailed Explanations of 20 Classic Kafka Interview Questions

Special Release Four Detailed Explanations of 20 Classic Kafka Interview Questions #

Hello, I’m Hu Xi. In this edition of “Special Broadcast,” I would like to share some common Kafka interview questions with you.

Whether as an interviewer or an interviewee, I have come across many Kafka interview questions. Some questions focus on assessing basic concepts, while others emphasize solutions for real-world scenarios. Some fall into the category of “show-off” questions, while others can be considered as thought-provoking “soul-searching” questions. “Show-off” questions are related to obscure Kafka component knowledge, while “soul-searching” questions mostly involve deeply contemplating Kafka’s design principles, requiring a high level of technical expertise.

The approach to tackling each type of question is actually quite different. Today, I will explain 20 interview questions in detail according to these four categories. However, I don’t intend to just provide the answers; I will also explain the purpose of the interview questions. In addition, I will share some interview tips with you, hoping to assist you in obtaining your desired offer more smoothly.

So, without further ado, let’s get started.

Basic Questions #

1. What is Apache Kafka? #

This is a very common question, which seems boring but actually tests multiple knowledge points.

First of all, it tests whether you have an accurate understanding of Kafka. As Apache Kafka has evolved over time, it has gradually transformed from a distributed commit log system to a real-time stream processing framework. Therefore, it is best to answer this question as follows: Apache Kafka is a distributed stream processing framework used for building real-time streaming applications. It has a core functionality that is well-known, which is widely used as an enterprise-level message broker.

In fact, there is a little trick hidden here. Kafka is positioned as a real-time stream processing framework and its acceptance in China is not very high yet. Therefore, when answering this question, you should first clarify its position as a stream processing framework, which can leave a very professional impression on the interviewer.

2. What is a consumer group? #

To some extent, this can be considered a “gifted question”. The consumer group is a unique concept of Kafka, and if the interviewer asks this question, it means that they have a certain understanding of it. Let me first give you the standard answer: Regarding its definition, the introduction on the official website is concise, that is, a consumer group is a scalable and fault-tolerant consumer mechanism provided by Kafka. Remember, be sure to add that sentence at the beginning to show that you are familiar with the official website.

In addition, it is best to explain the principle of consumer groups: In Kafka, a consumer group is a group composed of multiple consumer instances. Multiple instances subscribe to several topics together to achieve concurrent consumption. Each instance under the same group is configured with the same group ID and is assigned different subscription partitions. When one instance fails, other instances will automatically take over the partitions it was responsible for consuming.

At this point, I can give you another small strategy: the question about consumer groups can help you to some extent to control the direction of the interview questions.

If you are good at the offset principle, you may mention the offset commit mechanism of consumer groups.
If you are good at Kafka Brokers, you can point out the interaction between consumer groups and Brokers.
If you are good at Producers that are completely unrelated to consumer groups, you can say: “The data that consumer groups consume is completely produced by Producers, and I am quite familiar with Producers.”

Using this strategy, the interviewer may be influenced by your wording and deviate from the knowledge path they originally intended to ask. Of course, if you are not good at anything, then continue to read the next question.

3. In Kafka, what is the role of ZooKeeper? #

This is a question that can help you stand out. When you encounter this question, silently laugh three times in your heart. First, let’s start with the standard answer: Currently, Kafka uses ZooKeeper to store cluster metadata, manage members, elect a controller, and perform other administrative tasks. However, after the completion of KIP-500, Kafka will no longer rely on ZooKeeper.

Remember to emphasize the word “currently” to show that you are familiar with the community’s evolution plan. “Storing metadata” means that all data of topic partitions is stored in ZooKeeper, and other components need to align with the data stored in ZooKeeper. “Managing members” refers to the registration, deregistration, and attribute changes of Broker nodes, and so on. “Electing a controller” refers to the election of the cluster controller, and other administrative tasks include but are not limited to topic deletion and parameter configuration.

However, mentioning KIP-500 could be a double-edged sword. If you encounter a very experienced interviewer, they may further ask about what KIP-500 is. In short, the idea of KIP-500 is to replace ZooKeeper with a community-developed consensus algorithm based on Raft to achieve controller self-election. You may be concerned about how to respond if they continue to ask. Don’t worry, in the next “special issue,” I will discuss this matter specifically.

4. Explain the role of offset in Kafka #

This is also a common interview question. The concept of offset itself is not complicated, and you can answer it like this: In Kafka, each message under a topic partition is assigned a unique ID value called an offset, which identifies its position within the partition. Once a message is written to the partition log, its offset value cannot be modified.

After answering this, you can redirect the interview to the topic you want. Here are three common methods:

If you are familiar with the underlying logic of Broker log writing, you can emphasize the format of message storage in the log.
If you understand that offset values cannot be modified once determined, you can emphasize the fact that “even the Log Cleaner component cannot affect offset values.”
If you are familiar with the concept of consumers, you can further explain the difference between offset values and consumer offset values.

5. Describe the differences between leader replicas and follower replicas in Kafka #

This question seems to test your understanding of the difference between leaders and followers, but it can easily lead to discussing Kafka’s synchronization mechanism. Therefore, I suggest proactively addressing the implied point to showcase your expertise and possibly temporarily impress the interviewer.

You can answer like this: Currently, Kafka replicas are divided into leader replicas and follower replicas. Only leader replicas can provide read and write services and respond to requests from client ends. Follower replicas synchronize data in leader replicas passively using a pull mechanism and are ready to serve as leader replicas when the broker hosting the leader replica goes down.

Generally, answering to this extent only covers about 60% of the topic. Therefore, I suggest adding two additional bonus points.

Emphasize that follower replicas can also provide read services. Starting from Kafka 2.4, the community introduced new broker-side parameters that allow follower replicas to provide read services to a limited extent.
Emphasize that the message sequences between leaders and followers are not consistent in practice. Many factors can cause inconsistency between the message sequences stored in leaders and followers, such as program bugs, network issues, etc. This is a serious issue that must be completely avoided. You can add that the primary means to ensure consistency was the high watermark mechanism, but the high watermark value cannot guarantee data consistency in scenarios with continuous leader changes. Therefore, the community introduced the Leader Epoch mechanism to fix the shortcomings of the high watermark value. Unfortunately, there is not much information available about the Leader Epoch mechanism in the domestic community. Its popularity is far less than that of the high watermark. Thus, you can boldly showcase this concept and aim to impress. In the [27th lecture] of the previous column, the principle of the Leader Epoch mechanism is discussed. I recommend that you study it quickly.

Practical Exercises #

6. How to set the maximum message size that Kafka can receive? #

In addition to answering the question regarding parameter settings on the consumer side, it is important to also include the settings on the broker side to provide a complete answer. After all, if the producer cannot send large messages to the broker, there is no point in talking about consumption. Therefore, you need to set both broker-side parameters and consumer-side parameters.

Broker-side parameters: message.max.bytes, max.message.bytes (topic-level), and replica.fetch.max.bytes.
Consumer-side parameter: fetch.message.max.bytes.

The last broker-side parameter is easily overlooked. We must adjust the maximum message size that follower replicas can receive; otherwise, replica synchronization will fail. If you can mention this parameter, it will be considered a bonus point.

7. What are the frameworks available for monitoring Kafka? #

In fact, there is no widely recognized solution in the industry currently, and each company has its own monitoring methods. Therefore, the interviewer is actually testing your breadth of knowledge about monitoring frameworks, or in other words, if you are aware of many frameworks or methods available for monitoring Kafka. The following are some well-known monitoring systems in the history of Kafka development.

Kafka Manager: This can be considered the most famous dedicated Kafka monitoring framework and is an independent monitoring system.
Kafka Monitor: This is a free framework open sourced by LinkedIn, which supports system testing of clusters and provides real-time monitoring of test results.
CruiseControl: This is also an open-source monitoring framework developed by LinkedIn, used for real-time monitoring of resource usage and providing common operational tasks. It does not have a UI, only providing REST API.
JMX monitoring: Since Kafka provides monitoring metrics based on JMX, any framework that can integrate JMX, such as Zabbix and Prometheus, can be used.
Monitoring system of existing big data platforms: Big data platforms like Cloudera’s CDH naturally provide Kafka monitoring solutions.
JMXTool: This is a command-line tool provided by the community, which can monitor JMX metrics in real-time. Answering this point is definitely a bonus point, because few people know about it, and it will give the impression that you are very familiar with Kafka tools. If you are not familiar with its usage for now, you can run kafka-run-class.sh kafka.tools.JmxTool in the command line without any arguments to learn about its usage.

8. How to set the Broker’s Heap Size? #

The question of how to set the Heap Size is actually not directly related to Kafka; it is a very general interview question. If you do not respond properly, the interview may be led towards JVM and GC, which will increase the chances of being trapped. Therefore, I suggest that you briefly introduce the method of setting the Heap Size and focus on the best practices for setting the Kafka Broker’s heap size.

For example, you can reply as follows: The setting of JVM heap size for any Java process needs to be carefully considered and tested. A common practice is to run the program with the default initial JVM heap size, manually trigger a Full GC after the system reaches a stable state, and then use JVM tools to check the size of surviving objects after GC. Afterwards, set the heap size to 1.5 to 2 times the total size of the surviving objects. This method also applies to Kafka. However, the industry has a best practice of setting the Broker’s Heap Size to a fixed value of 6GB. This size has been verified by many companies and is considered sufficient and optimal.

9. How to estimate the number of machines in a Kafka cluster? #

This question tests the relationship between the number of machines and the resources used. The so-called resources refer to CPU, memory, disk, and bandwidth.

Generally speaking, it is relatively easy to ensure the sufficiency of CPU and memory resources, so you need to evaluate the number of machines from the two dimensions of disk space and bandwidth usage.

When estimating disk usage, you must not forget to consider the overhead of replica synchronization. If a message occupies 1KB of disk space, then in a topic with 3 replicas, you need a total space of 3KB to store this message. Explicitly mentioning these considerations will demonstrate your comprehensive thinking skills and is a rare bonus point.

For evaluating bandwidth, common bandwidths are 1Gbps and 10Gbps, but you must remember that these two numbers are only the maximum values. Therefore, it is best to confirm with the interviewer the provided bandwidth. Then, clearly explain that packet loss will occur when the bandwidth usage reaches 90% of the total bandwidth. This will demonstrate your basic networking knowledge.

10. How to deal with Leader always being -1? #

In a production environment, you must have encountered situations where “a certain topic partition cannot work.” If you check the status using the command line, you will find that the Leader is -1, and then you try various commands, but none of them work, so you can only resort to the “reboot method.”

However, is there any way to solve this problem without restarting the cluster? This is the purpose of this question.

I’ll give you the answer directly: Delete the ZooKeeper node/controller to trigger a Controller re-election. Controller re-election can refresh the partition status for all topic partitions and effectively resolve the problem of an unavailable Leader due to inconsistency. I can almost guarantee that when the interviewer asks this question, either he really doesn’t know how to solve it and is seeking your answer, or he is waiting for you to say this answer. So, don’t start off by saying “restart it” or anything like that.

Fancy problem #

11. What do LEO, LSO, AR, ISR, and HW mean? #

In my opinion, this is just a boring fancy problem. So what if I don’t know the answer?! But since you asked, let’s discuss it together.

LEO: Log End Offset. It represents the offset value of the next message to be inserted into the log. For example, if there are 10 messages in the log and the offset value starts from 0, then the offset value of the 10th message is 9. In this case, LEO = 10.
LSO: Log Stable Offset. This is a concept related to Kafka transactions. If you are not using transactions, then this value does not exist (or it is set to a meaningless value). LSO controls the range of messages that transactional consumers can see. It is often confused with Log Start Offset, which is abbreviated as LSO by some people, but that is incorrect. In Kafka, LSO refers to Log Stable Offset.
AR: Assigned Replicas. AR refers to the set of replicas assigned when a topic is created. The number of replicas is determined by the replication factor.
ISR: In-Sync Replicas. This is a very important concept in Kafka, referring to the set of replicas in AR that are in sync with the leader. Replicas in AR may not be in ISR, but the leader replica is naturally included in ISR. There is also a common interview question about how to determine whether a replica should be part of ISR. Currently, the criteria for judging is whether the time it takes for a follower replica’s LEO to lag behind the leader LEO exceeds the value of the broker-side parameter replica.lag.time.max.ms. If it exceeds, the replica will be removed from ISR.
HW: High Watermark. This is an important field that controls the range of messages that consumers can read. A regular consumer can only “see” all messages on the leader replica between Log Start Offset and HW (excluding HW). Messages above the watermark are not visible to consumers. There are many ways to ask about HW, and the most advanced way I can think of is to ask you to outline the detailed steps of how follower replicas pull data from the leader replica and perform synchronization. This is our 20th question, and I will provide the answer and analysis in a moment.

12. Can Kafka manually delete messages? #

Actually, Kafka does not require users to manually delete messages. It provides retention policies to automatically delete expired messages. However, it does support manual deletion of messages. Therefore, it is best to answer this question from both dimensions.

For topics that have a key set and the parameter cleanup.policy=compact, we can construct a message and send it to the broker to delete messages with that key, relying on the functionality provided by the Log Cleaner component.
For regular topics, we can use the kafka-delete-records command or write a program to call the Admin.deleteRecords method to delete messages. These two methods achieve the same result, as they both ultimately call the deleteRecords method of Admin and indirectly delete messages by increasing the Log Start Offset value for the partition.

13. What is __consumer_offsets used for? #

This is an internal topic, and there is scarce public information about it on the official website. Therefore, I believe this question belongs to the category of fancy questions asked by interviewers. You need to be careful about the focus of this question: there are 3 important points about this topic, and you must answer all of them to demonstrate your expertise in this area.

It is an internal topic that does not require manual intervention and is managed by Kafka itself. Of course, we can create this topic.
Its main purpose is to register consumers and save offset values. You may be familiar with the function of saving offset values, but it is also important to note that the topic also stores consumer metadata. Additionally, consumers here refer to both consumer groups and standalone consumers, not just consumer groups.
The GroupCoordinator component of Kafka provides complete management functions for this topic, including creating, writing, reading, and maintaining the leader of the topic.

14. How many types of partition leader election strategies are there? #

The partition leader replica election is completely transparent to the user and is handled independently by the Controller. What you need to answer is the scenarios in which partition leader election needs to be performed. Each scenario corresponds to a leader election strategy. Currently, Kafka has four partition leader election strategies.

OfflinePartition Leader election: Leader election needs to be performed whenever a partition comes online. The term “partition comes online” may refer to the creation of a new partition or the reconnection of a previously offline partition. This is the most common scenario for partition leader election.
ReassignPartition Leader election: This type of election may be triggered when you manually run the kafka-reassign-partitions command or call the Admin’s alterPartitionReassignments method to perform partition replica reassignment. Assuming the original AR is [1, 2, 3] and Leader is 1, when replica reassignment is executed and the replica set AR is set to [4, 5, 6], it is obvious that the Leader must be changed, and this will trigger a Reassign Partition Leader election.
PreferredReplicaPartition Leader election: When you manually run the kafka-preferred-replica-election command or automatically trigger the Preferred Leader election, this strategy will be activated. The Preferred Leader refers to the first replica in the AR. For example, if AR is [3, 2, 1], then the Preferred Leader is 3.
ControlledShutdownPartition Leader election: When a broker is shut down normally, all leader replicas on that broker will go offline, so the corresponding leader election needs to be performed for the affected partitions.

The general idea of these four election strategies is similar, which is to select the first replica in the AR that is also in the ISR as the new Leader. Of course, there may be minor differences in individual strategies. However, answering to this level should be sufficient to handle interview questions. After all, minor differences have little impact on the election of the Leader.

15. In which scenarios of Kafka is Zero Copy used? #

Zero Copy is a high-level topic that is frequently asked. In Kafka, there are two places where Zero Copy is used: Indexed based on mmap and TransportLayer used for log file read and write.

Let’s start with the first one. The indexes are based on MappedByteBuffer, which allows user space and kernel space to share the kernel’s data buffer, so the data does not need to be copied to the user space. However, although mmap avoids unnecessary copying, it does not necessarily guarantee high performance. The cost of creating and destroying mmap may vary depending on the operating system. High creation and destruction overhead can offset the performance advantage of Zero Copy. Due to this uncertainty, only the indexes in Kafka use mmap, and the core log does not use the mmap mechanism.

Now for the second one. TransportLayer is the interface of Kafka’s transport layer. One of its implementation classes uses the transferTo method of FileChannel. This method implements Zero Copy using the sendfile system call. For Kafka, if the I/O channel uses plain PLAINTEXT, Kafka can take advantage of the Zero Copy feature and directly send the data from the page cache to the buffer of the network card, avoiding multiple intermediate copies. Conversely, if the I/O channel is enabled with SSL, Kafka cannot leverage the Zero Copy feature.

Thought-Provoking Questions #

16. Why doesn’t Kafka support read-write separation? #

This question tests your understanding of the Leader/Follower model.

The Leader/Follower model does not specify that Follower replicas cannot provide read services. Many frameworks allow this, but Kafka initially adopted a unified approach of having the Leader provide services to avoid inconsistency problems.

However, as you start answering this question, you can present your viewpoint: Since Kafka 2.4, Kafka provides limited read-write separation, which means that Follower replicas can provide read services externally.

After mentioning this, you can give the reasons why previous versions did not support read-write separation:

Unsuitable scenarios: Read-write separation is suitable for scenarios where there is a high read load and relatively infrequent write operations, but Kafka does not belong to such scenarios.
Synchronization mechanisms: Kafka uses the PULL method to synchronize Followers. As a result, there is an inconsistency window between the Follower and the Leader. If reading from Follower replicas is allowed, the issue of message lag must be addressed.

17. How to optimize Kafka? #

The first step in answering any optimization question is to determine the optimization goals and quantify them! This is particularly important. For Kafka, common optimization goals are throughput, latency, durability, and availability. Each optimization direction has different ideas and may even be contradictory.

Once you have determined the goals, you need to clarify the optimization dimensions. Some optimizations are common, such as optimizing the operating system and JVM, while others are targeted, such as optimizing Kafka’s TPS. We need to consider it from three perspectives.

Producer: Increase batch.size, linger.ms, enable compression, disable retries, etc.
Broker: Increase the num.replica.fetchers to improve Follower synchronization TPS, avoid Broker Full GC, etc.
Consumer: Increase fetch.min.bytes, etc.

18. What happens when the Controller experiences network partitioning in Kafka? #

This question can trigger our thinking on distributed system design, CAP theorem, consistency, and many other aspects. However, for fault localization and analysis questions like this, I suggest taking a “practical first” approach, which means that no matter how theoretical the analysis is, the actual results should always prevail. Once a Controller network partition occurs, the first thing to check is whether the cluster has experienced “split-brain,” that is, when two or more Controller components appear simultaneously. This can be judged based on the Broker-side monitoring metric ActiveControllerCount.

Now, let’s analyze what happens when this situation occurs.

Since the Controller sends three types of requests to Brokers, namely LeaderAndIsrRequest, StopReplicaRequest, and UpdateMetadataRequest, if a network partition occurs, these requests will not reach the Broker smoothly. This affects the information synchronization of topic creation, modification, deletion operations, making the cluster seem stuck, and unable to perceive any subsequent operations. Therefore, network partitioning is usually a very serious problem that needs to be repaired quickly.

19. Why does the Java Consumer use a single thread to fetch messages? #

Before answering, if you first state this sentence, you will get bonus points: The Java Consumer is a dual-threaded design. One thread is the user’s main thread, responsible for fetching messages, while the other thread is the heartbeat thread, responsible for reporting consumer liveness to Kafka. Putting the heartbeat in a dedicated thread can effectively avoid false death situations caused by slow message processing.

The design of fetching messages with a single thread avoids blocking message retrieval. Single-threaded polling is easy to implement as asynchronous non-blocking, which facilitates the expansion of consumers into support for real-time stream processing operators. This is because many real-time stream processing operators cannot be blocking. Another possible benefit is that it simplifies code development. Code involving interactions between multiple threads is prone to errors.

20. Describe the complete process of Follower replica message synchronization. #

First, the Follower sends a FETCH request to the Leader. Then, the Leader reads the message data from the underlying log file, updates the Follower replica’s LEO value in its memory to the fetchOffset value in the FETCH request, and finally attempts to update the partition’s high water mark. After receiving the FETCH response, the Follower writes the messages to the underlying log, then updates the LEO and HW values.

The timing of HW updates differs for Leaders and Followers, with the Follower’s HW always lagging behind the Leader’s HW. This time mismatch is the cause of various inconsistencies.

Alright, that’s all for today’s interview question sharing. Have you encountered any classic interview questions? Or do you have any good interview experiences?

Feel free to share in the comments section. If you found today’s content helpful, feel free to share it with your friends.