Bonus Set Up Development Environment, Read Source Code Methodology, Classic Study Materials Revealed

Bonus - Set Up Development Environment, Read Source Code Methodology, Classic Study Materials Revealed #

Hello, I’m Hu Xi.

Up to now, 38 lectures have been updated in this column. How well have you mastered them? If you haven’t mastered them well for the time being, it’s okay. Take your time, and remember to leave a message in the comment section if you have any questions. We can discuss them together.

Today, let’s talk about something different. I have summarized three highly discussed topics for you, and now I will “reveal” them one by one.

  1. How to build a Kafka development environment? Many people are interested in compiling and debugging Kafka, but they struggle with getting started. Today, I will demonstrate the complete process of building a Kafka development environment.
  2. How to read Kafka source code? I mentioned in the first lecture of this column that I had read Kafka source code myself. Later, I received many messages asking me how I read it. Today, I will share with you some good principles or techniques for reading Kafka source code.
  3. Learning materials for Kafka. Fortunately, I have made some summaries in this regard, and today I will share all the materials with you without reservation.

Setting up the Kafka Development Environment #

Now, let me answer the first question: how to set up the Kafka development environment. I will use IntelliJ IDEA as an example, but Eclipse should be similar.

Step 1: Install Java and Gradle #

To set up the Kafka development environment, you must have Java and Gradle installed. Additionally, you need to install the Scala plugin in IntelliJ IDEA. It is recommended to add Java and Gradle to your environment variables.

Step 2: Download the Kafka Source Code #

Once you have completed step 1, download the Kafka source code by executing the following command:

$ cd Projects
$ git clone https://github.com/apache/kafka.git

This command downloads the Kafka source code from the trunk branch, which includes the latest code that has been submitted as patches. It may even be more advanced than the latest version available for download on the Kafka website. It is worth noting that if you want to contribute code to the Kafka community, you usually develop based on the trunk code.

Step 3: Download the Gradle Wrapper #

After downloading the code, a subdirectory named “kafka” will be created automatically. Go into this directory and execute the following command to download the Gradle Wrapper:

$ gradle
Starting a Gradle Daemon (subsequent builds will be faster)

> Configure project :
Building project 'core' with Scala version 2.12.9
Building project 'streams-scala' with Scala version 2.12.9

Deprecated Gradle features were used in this build, making it incompatible with Gradle 6.0.
Use '--warning-mode all' to show the individual deprecation warnings.
See https://docs.gradle.org/5.3/userguide/command_line_interface.html#sec:command_line_warning

Step 4: Compile and Package the Kafka Source Code into Jar Files #

Now, you can run the following command to compile and package the Kafka source code into Jar files:

./gradlew clean releaseTarGz

Usually, you need to wait for a while. After a series of operations, such as Gradle fetching dependency Jar files, compiling Kafka source code, and packaging, you will find the generated tgz package (kafka_2.12-2.4.0-SNAPSHOT) in the build/distributions directory under the core directory. Once you extract it, you will have a functional Kafka environment that can be started and run.

Step 5: Import the Kafka Source Code Project into IntelliJ IDEA #

This is the final step in setting up the development environment. Execute the following command to create the project files required by IntelliJ IDEA:

$ ./gradlew idea  # If you are using Eclipse, execute: ./gradlew eclipse

Then, open IntelliJ IDEA and choose “Open Project”. Select the kafka directory.

At this point, we have set up the Kafka source code environment in IntelliJ IDEA. You can open the Kafka.scala file, right-click, and select “Run”. Now, you should see the command-line usage instructions for starting Kafka Broker as shown in the following screenshot:

Kafka Broker Command-Line Usage

Overall, since the Kafka project started using Gradle instead of sbt, the compilation and build process has become much simpler. With just about 3 or 4 commands, you can set up a test development environment on your local machine.

Methods for Reading Kafka Source Code #

After setting up the development environment, the next step naturally is to read the Kafka source code and try to modify it on your own. The following image shows the complete directory list of the Kafka project in IDEA.

In this image, there are several subdirectories that you need to pay special attention to.

  • core: The Broker-side project, which contains the Broker code.
  • clients: The Client-side project, which contains all the Client code as well as some common code used by all the code.
  • streams: The Kafka Streams project, which contains the Kafka Streams code.
  • connect: The Kafka Connect project, which contains the Kafka Connect framework code as well as the File Connector code.

I previously mentioned that the Kafka source code is over 500,000 lines long, and reading it without focus will be very inefficient. Initially, I blindly read the source code and found it to be very ineffective, so I think it is necessary to recommend a few best practices for you.

I suggest that you start reading from the core package, specifically from the Broker-side code. You can follow the order below for reading:

  1. The log package: The log package defines the underlying message and index storage mechanism as well as the physical format. It is worth reading, especially the classes Log, LogSegment, and LogManager, which almost define the underlying message storage mechanism of Kafka. Pay special attention to these classes.
  2. The controller package: The controller package implements all the functionalities of the Kafka Controller, especially the KafkaController.scala file, which encapsulates all the event handling logic of the Controller. If you want to understand how the Controller works, it is best to read this large file with nearly 2000 lines multiple times.
  3. The code in the group package under the coordinator package: Currently, the coordinator package has two sub-packages: group and transaction. The former encapsulates the Coordinator used by Consumer Groups, and the latter encapsulates the Transaction Coordinator used to support Kafka transactions. Personally, I think you should read the code in the group package thoroughly to understand how the Broker-side manages Consumer Groups. The important classes here are GroupMetadataManager and GroupCoordinator, which define the metadata information of Consumer Groups and the state machine mechanism for managing these metadata.
  4. The code in the network package and some code in the server package: If you still have the energy, you can read these code as well. The SocketServer class in the former implements the complete network process for Broker to receive external requests. We mentioned in Lecture 24 that Kafka uses the Reactor pattern. If you want to understand how the Reactor pattern is used in Kafka, you should understand this class.

From an overall perspective, the entry class for the Broker-side is KafkaApis.scala. This class is the general entry point for handling all inbound requests. The following image shows some of the request handling methods:

You can go into different methods to see the actual request handling logic. For example, the handleProduceRequest method handles requests for producing messages, while the handleFetchRequest method handles requests for reading messages.

We just mentioned the important class files under the core code package. Under the clients package on the client-side, I recommend that you focus on reading the content of 4 parts.

  1. The org.apache.kafka.common.record package: This package contains various Kafka message entity classes, such as the MemoryRecords class used for in-memory transfers and the FileRecords class used for on-disk storage.
  2. The org.apache.kafka.common.network package: You don’t need to read this package in its entirety. Focus on the Selector and KafkaChannel, especially the former. They are important mechanisms for implementing network transmission between the Client and Broker. If you fully understand the Java code in this package, many network exception problems in Kafka will be easily resolved.
  3. The org.apache.kafka.clients.producer package: As the name suggests, it is the implementation package of the Producer. There are many Java classes inside, and you can focus on KafkaProducer, Sender, and RecordAccumulator.
  4. The org.apache.kafka.clients.consumer package: It is the implementation package of the Consumer. Similarly, I recommend that you focus on reading KafkaConsumer, AbstractCoordinator, and Fetcher.

Additionally, when reading the source code, whether it is on the Broker-side or the Client-side, it is best to do it with Java debugging. By setting breakpoints in Debug mode, you can incrementally understand the state of each class in Kafka and the information stored in memory. This reading method will greatly enhance your efficiency.

If you are not currently interested in setting up a development environment or reading the source code, but still want to learn Kafka quickly and in depth, learning from existing materials is also a good approach. Next, I will recommend some valuable learning materials for Kafka.

The first and most crucial recommendation is the Kafka official website. Many people overlook the official website, but it is actually the most important learning resource. By thoroughly reading the website and mastering its content, you can already have a good grasp of Kafka.

The second recommendation is the Kafka JIRA list. When you encounter a Kafka exception, you can search for related keywords in the JIRA to see if it is a known bug. Many times, the issues we encounter have already been discovered and submitted to the community by others. In this case, the JIRA list is a helpful tool for troubleshooting.

The third recommendation is the Kafka Improvement Proposals (KIP) list, which can be found here. KIP provides new feature proposals and discussions for Kafka. If you want to understand Kafka’s future development roadmap, reading the KIP is essential. Of course, if you have ideas for new features that Kafka does not currently have, you can submit your own proposal and wait for the community’s review.

The fourth recommendation is the design documents maintained by the Kafka team. Here, you can find almost all the Kafka design documents. The articles on the Controller and the new version of the Consumer are particularly insightful, and I recommend focusing on them.

The fifth recommendation is the renowned StackOverflow forum. Nowadays, StackOverflow forum is of great importance to programmers, and I believe you already know that. The Kafka-related questions on this forum are quite in-depth. In fact, it is not uncommon for a simple question on StackOverflow to evolve into a Kafka bug fix or the implementation of a new feature.

The sixth recommendation is Confluent’s technical blog, which is maintained by the commercial company that offers Kafka solutions, Confluent. The technical articles on this blog are written by Kafka Committers and are of high quality. For example, the articles on Kafka’s exactly-once semantic and transactions are highly valuable, so you should definitely read them.

The seventh recommendation is my own blog. I regularly update original articles on Kafka on my blog. Some of them are my understanding of Kafka technology, and others are the latest news about Kafka. Although it may not be the best quality blog in China, it has been consistently focused on Kafka for many years.

Finally, I would like to recommend 3 books for learning Kafka.

The first book is my own “Apache Kafka in Action”, in which I summarize my practical experiences using and learning Kafka over the years. This book was published in 2018 and based on Kafka 1.0. Although Kafka has now reached version 2.3, the messaging engine has not undergone significant changes, so most of the content is still relevant.

The second book is “Kafka: The Definitive Guide”. I personally really like the author’s writing style, and this book provides detailed and thorough analysis of Kafka principles, with excellent illustrations.

The third book is a new book published in 2019 called “Understanding Kafka in Depth”. The author of this book is a well-known expert in RabbitMQ and Kafka, with unique insights into message middleware.

Each of these resources has its own focus, so you can choose the appropriate materials based on your actual needs.

Summary #

Alright, let’s summarize. In today’s article, I shared a lot of experiences with you, such as how to set up a Kafka development environment, how to read Kafka source code, etc. I hope these experiences can help you save time effectively and avoid some detours. Additionally, I have listed all the relevant learning materials that I have collected and shared them with you. I hope these resources can assist you in learning Kafka better.

Speaking of which, I would like to emphasize once again that learning is a continuous process. While experience and external help are important, the most crucial thing is to put in your own effort and persevere.

Remember: Stay focused and work hard!

Open Discussion #

Lastly, let’s discuss a question: what do you think is the most important aspect of learning Kafka or any other technology?

Feel free to share your thoughts and answers, and let’s discuss together. If you find this discussion beneficial, don’t hesitate to share this article with your friends.