00 Guide Building Kafka Projects and Source Code Reading Environment Warmup in Scala Language

00 Guide Building Kafka Projects and Source Code Reading Environment Warmup in Scala Language #

Hello, I’m Hu Xi.

Starting today, we will officially step into the world of Kafka source code. Since this course is designed to teach you how to read Kafka source code, the first thing you need to master is how to set up Kafka’s source code environment on your own computer and even know how to debug them. In this lesson, I have demonstrated many practical steps, and it is recommended that you follow along and try them out, otherwise it will be difficult to have a deep understanding.

Without further ado, let’s now proceed to set up the source code environment.

Environment Setup #

Before reading the Kafka source code, we need to do some necessary preparation work. This involves installing some tools and software, such as Java, Gradle, Scala, IDE, Git, etc.

If you are setting up the environment on a Linux or Mac system, you need to install Java, an IDE, and Git. If you are using Windows, then you need to install all of them.

In this course, we will use the following versions for source code explanation:

Oracle Java 8: We use Oracle’s JDK and Hotspot JVM. If you prefer other vendors or open-source versions of Java (such as OpenJDK), you can choose to install different JVM versions.
Gradle 6.3: The Kafka source code that I will guide you through in this course is from the community’s Trunk branch. The Trunk branch has currently evolved to version 2.5 and supports Gradle 6.x versions. It is recommended to install Gradle 6.3 or a higher version.
Scala 2.13: The community’s Trunk branch compiles with two Scala versions, 2.12 and 2.13. It defaults to using Scala 2.13 for compilation, so I recommend installing Scala 2.13.
IDEA + Scala plugin: This course uses IDEA as the IDE to read and configure the source code. I have great respect for Eclipse, but personally, I prefer using IDEA. Additionally, you need to install the Scala plugin for IDEA, which will enable you to easily read Scala source code.
Git: Installing Git is primarily for managing Kafka source code versions. If you want to become a community code contributor, a Git management tool is essential.

Building a Kafka Project #

Once you are ready with the prerequisites, we can proceed with building the Kafka project.

First, we need to download the Kafka source code. The method is simple: find a clean source code directory and execute the following command to download the community’s Trunk code:

$ git clone https://github.com/apache/kafka.git

After a long wait, a new subdirectory named “kafka” will be created in your path, which is the root directory of the Kafka project. If the above command fails to execute in your environment, you can download the source code ZIP package from the official website link or the link I provided (here, extraction code: ntvd), and then unzip it. However, note that in doing so, you will lose Git management, and you will have to manually link to the remote repository. You can refer to this Git document for specific instructions.

After downloading, you need to navigate to the directory where the project is located, i.e., the “kafka” directory, and then execute the corresponding commands to build the Kafka project.

If you are setting up the environment on macOS or Linux, you can directly run the following command to build:

$ ./gradlew build

This command first downloads the necessary jar files for Gradle Wrapper, and then builds the Kafka project. It is worth noting that when executing this command, you may encounter the following exception:

Failed to connect to raw.githubusercontent.com port 443: Connection refused

If you encounter this exception, there is no need to panic. You can go to this official website link or the link I provided to directly download the required Jar package for Wrapper, and manually copy this Jar file to the “gradle/wrapper” subdirectory under the “kafka” directory. Then, re-execute the “gradlew build” command to build the project.

I would like to remind you that the official website link contains the version number “v6.3.0,” but this version may change in the future. Therefore, it is best to first open the “gradlew” file and check which version of Gradle the community is using. Once you find that the version is no longer “v6.3.0,” you should not use the link I provided. At this point, you need to directly download the corresponding version of the Jar package from the official website.

For example, suppose the Gradle version used in the “gradlew” file changes to “v6.4.0.” In that case, you need to modify the version number in the URL of the official website link to “v6.4.0” and download the Jar package for this version.

If you are building on a Windows platform, you cannot use Gradle Wrapper because Kafka does not provide a runnable Wrapper Bat file for the Windows platform. In this case, you can only use the Gradle installed in your own environment. The specific command is:

kafka> gradle.bat build

Whether it is the “gradle.bat build” command or the “gradlew build” command, it will take a considerable amount of time to download the necessary Jar packages for the first run. Please be patient.

Next, let me show you a diagram of the various directories and files of the Kafka project:

Here, I will briefly introduce some of the main component paths.

bin directory: This directory contains scripts for Kafka tools. Scripts such as “kafka-server-start” and “kafka-console-producer” that we are familiar with are stored here.
clients directory: This directory contains Kafka client code. For example, the code for producers and consumers is located in this directory.
config directory: This directory contains Kafka configuration files. The most important configuration file is “server.properties.”
connect directory: This directory contains the source code for the Connect component. As I mentioned earlier, Kafka Connect is used for real-time data transfer between Kafka and external systems.
core directory: This directory contains the broker-side code. All Kafka server-side code is stored in this directory.
streams directory: This directory contains the source code for the Streams component. Kafka Streams is the component used for real-time stream processing with Kafka.

The other directories either are less important or are related to configuration, so I won’t go into detail about them.

In addition to the “gradlew build” command mentioned above, let me introduce some commonly used build commands to help you debug the Kafka project.

First, let’s look at the test-related commands. The Kafka source code is divided into four major parts: the Broker-side code, the Clients-side code, the Connect-side code, and the Streams-side code. If you want to test the code for these four parts separately, you can run the following four commands respectively:

$ ./gradlew core:test
$ ./gradlew clients:test
$ ./gradlew connect:[submodule]:test
$ ./gradlew streams:test

You may have noticed that the testing method for the Connect component is slightly different in these four commands. This is because the Connect project is further divided into multiple submodules, such as “api,” “runtime,” etc. Therefore, you need to explicitly specify the name of the submodule you want to test in order to perform the test.

If you want to test a specific test case separately, such as testing the “LogTest” class in the “core” package of the Broker-side, you can use the following command:

$ ./gradlew core:test --tests kafka.log.LogTest

Additionally, if you want to build the entire Kafka project and package it into a runnable binary environment, you need to run the following command:

$ ./gradlew clean releaseTarGz

After a successful run, a binary release package will be generated for the “core,” “clients,” and “streams” directories, respectively:

kafka-2.12-2.5.0-SNAPSHOT.tgz: This is the Broker-side release package of Kafka. After extracting this file, you will have a standard Kafka runtime environment. This file is located in the “/build/distributions” directory under the “core” path.
kafka-clients-2.5.0-SNAPSHOT.jar: This Jar package is the binary release package of the Clients-side code after compilation and packaging. This file is located in the “/build/libs” directory under the “clients” directory.
kafka-streams-2.5.0-SNAPSHOT.jar: This Jar package is the binary release package of the Streams-side code after compilation and packaging. This file is located in the “/build/libs” directory under the “streams” directory.

Setting up the source code reading environment #

In my previous article, I explained how to use the Gradle tool to build the Kafka project. Now, I will show you how to set up the Kafka source code reading environment using IDEA. In fact, the whole process is very simple. We open IDEA, click on “File”, and then click on “Open”. Select the Kafka file path from the previous step.

After the project is imported, IDEA will automatically build the project. When the build is complete, you can find the Kafka.scala file under the src/main/scala/kafka directory. Open it, then right-click on Kafka, and you should see the following output:

This is the result of running the Kafka main file without any arguments. Through this output, we can learn the required parameters for starting a Kafka Broker, which is to specify the address of the server.properties file. This is the standard command for starting a Kafka Broker.

In the introduction, I also mentioned that this course will focus on explaining the Kafka Broker source code. Therefore, before we dive into the source code, let me briefly introduce the organization structure of the Kafka Broker source code. The following diagram shows the code architecture of the Kafka core package:

Let me explain a few key packages.

The controller package contains the code for the Kafka controller component. The controller is a core component of Kafka, and we will analyze the code in this package in detail later.
The coordinator package contains the code for the GroupCoordinator component on the consumer side and the TransactionCoordinator component for transactions. Analyzing the coordinator package, especially the code for the GroupCoordinator on the consumer side, is crucial for understanding the design principles of the coordinator component on the Broker.
The log package contains the code for the core log structures in Kafka, including logs, log segments, index files, etc. We will discuss this in detail later. Additionally, this package encapsulates the mechanism for log compaction, making it an important code package.
The network package encapsulates the code for the network layer of the Kafka server, especially the SocketServer.scala file, which implements the Reactor pattern in Kafka. It is definitely worth reading.
The server package, as the name suggests, contains the main code for the Kafka server. It contains many important Kafka components, such as the state machine and the Purgatory delay mechanism, which we will discuss later.

In the following courses, I will select the most important classes in Kafka for detailed analysis, helping you gain a deeper understanding of the implementation principles of these important components in the Kafka Broker.

Furthermore, although this course will not cover the analysis of test cases, I believe that understanding test cases is one of the most effective shortcuts to quickly grasp Kafka components. If time allows, I recommend that you read the test cases of each component in Kafka, which are usually located in the src/test directory of the code package. For example, in the Kafka log source code, the corresponding LogTest.scala test file is very comprehensive, with dozens of test cases covering all aspects of the Log. You should definitely take a look at it.

Warm-up with Scala Language #

Before diving into the source code of the Broker, I would like to take a moment to introduce the syntax features of the Scala language. I will use a few real Kafka source code snippets to help you warm up.

Let’s start with the first one:

def sizeInBytes(segments: Iterable[LogSegment]): Long =
    segments.map(_.size.toLong).sum

This is a typical Scala method called sizeInBytes. It takes a collection of LogSegment objects as input and returns a long integer. LogSegment is the log segment that we will discuss later. Each .log file you see in the Kafka partition directory is essentially a LogSegment. From the name, this method calculates the total number of bytes for this collection of LogSegments. The specific method is to iterate through each input LogSegment, call its size method, and return the sum after accumulating.

Now let’s take a look at another one:

val firstOffset: Option[Long] = ......

def numMessages: Long = {
    firstOffset match {
      case Some(firstOffsetVal) if (firstOffsetVal >= 0 && lastOffset >= 0) => (lastOffset - firstOffsetVal + 1)
      case _ => 0
    }
  }

This method is a member of the LogAppendInfo object and it calculates the number of messages written in a single batch on the Broker side. Here, you need to pay attention to the match and case keywords. You can roughly consider them equivalent to the switch statement in Java, but they are much more powerful. The logic for counting the written messages in this method is as follows: if both firstOffsetVal and lastOffset values are greater than 0, the number of written messages is equal to the difference between them plus 1; if firstOffsetVal does not exist, it is not possible to count the written messages, so simply return 0.

If you find it difficult to understand the above code snippets, I suggest you quickly learn Scala language. What should you focus on? I recommend you focus on learning the syntax for traversing collections in Scala and the usage of pattern matching based on match.

In addition, since Scala has a functional programming style, you should at least understand the meaning of Lambda expressions in Java, which will greatly help you overcome reading obstacles.

On the contrary, if the above code snippets are easy for you to understand, then there should be no problem understanding 80% of the source code of the Broker. You might also be concerned about the remaining 20% of the source code, which is difficult to understand. Actually, it’s okay. You can read them after you become proficient in Scala language. They won’t affect your mastery of the core source code. In addition, when it comes to more challenging Scala syntax features later, I will explain them in more detail, so don’t worry about the language issues at all!

Summary #

Today, we had a “warm-up lesson” for analyzing the Kafka source code. I provided specific methods for building the Kafka project and setting up the environment for reading the Kafka source code. I suggest that you go through the process again according to the content above to personally experience the process of building the Kafka project and importing the source code project. After all, these are prerequisites for reading the specific Kafka code.

Lastly, I want to emphasize that reading the source code of any large project is not an easy task. I hope you never give up at any time. Often, when you encounter code that you don’t understand, just read it a few more times, and maybe you will have a sudden realization later on.

After-class Discussion #

If you are familiar with Kafka, you must have heard of the kafka-console-producer.sh script. As I mentioned before, this script is located in the bin directory of the project. Can you find out which Java class it corresponds to? This search process can give you some good ideas for finding the necessary Kafka class files. You can try it out.

Feel free to express yourself and discuss with me in the comment section. You are also welcome to share this article with your friends.