00 Opening Words Gradual Progression Through Source Code Reading Becomes a Necessary Step on the Professional Ladder

00 Opening Words Gradual Progression through Source Code Reading Becomes a Necessary Step on the Professional Ladder #

Preface: Reading source code gradually becomes a “must-have” for career advancement

Hello, I’m Hu Xi, an Apache Kafka Committer, Head of User Growth Team at Tiger Brokers, and also the author of the book “Apache Kafka in Action”.

In 2019, I started my first column on Geek Time, “Core Technologies and Applications of Kafka”, aimed at helping Kafka users master the core design principles and practical application techniques of Kafka. After a year, I have now brought a source code column. In this column, I will take you deep into the Kafka core source code, providing detailed analysis and narration of the architectural ideas and programming concepts behind the source code. At the same time, I will also offer source code-level solutions for some challenging problems.

Why read the source code? #

When it comes to source code analysis, especially the source code of messaging engines like Apache Kafka, you may say, “I am already using it and I am quite familiar with it, so why should I spend time reading the source code?”

Of course, some non-Kafka users may also say, “I don’t use Kafka, so what’s the use of reading its source code?”

Before reading the source code, that’s exactly what I thought as well. However, a particular incident in a production environment completely changed my mindset.

So here is the story: Kafka Broker has a parameter called log.retention.bytes, and the official documentation describes it as specifying the maximum value of retained logs. With the help of this parameter, we confidently assured our leaders that it would not excessively consume the already tight physical disk resources of the company. However, the final actual disk space used far exceeded this maximum value.

We searched through various resources, but we could not find the root cause of the problem. At that time, I thought that maybe I could try reading the source code and see what happens. As a result, the source code clearly explained that whether this parameter works or not is closely related to the log segment size. Knowing this, the problem was easily solved.

At that moment, I realized that many challenging problems can only be solved by delving into the source code.

In addition, I found that in the job requirements for senior technical positions in many internet companies, “read the source code of at least one open-source framework” is prominently listed. This means that reading the source code is gradually transitioning from being an “extra point” to a “required skill,” and mastering the implementation of excellent framework code has changed from a “nice-to-do” to a “must-do.”

So, why has reading the source code become a “must-do” skill? What is its significance? I will share with you a few benefits of reading the source code based on my own experience.

1. It helps you gain a deeper understanding of the internal design principles, improve your system architecture abilities, and enhance your coding skills.

As an excellent messaging engine, Kafka has many praised aspects of its architecture design. Mastering these principles greatly enhances our system architecture abilities and coding skills.

Even if you don’t use Kafka, you can still learn from its excellent design concepts and improve your system architecture abilities in other frameworks.

You may ask, can’t I just rely on the official documentation, which explains these principles?

In fact, I have always believed that there is still a lot of room for improvement in the content of community official documentation. Kafka has many great design concepts and features that are not fully elaborated in the documentation.

Let me give you a simple example. Kafka has a very important concept called the “current log segment”. Many Kafka components (such as LogCleaner) treat the current log segment differently from non-current log segments. However, Kafka’s official website hardly mentions it at all.

So you see, if you rely solely on the official documentation, you cannot gain a deep understanding of Kafka.

2. It helps you quickly identify issues, formulate optimization plans, and reduce the time cost of problem-solving.

Many people believe that reading the source code takes a lot of time and is not worthwhile. This is a big misconception.

In fact, the knowledge you gain from the source code can guide your future practice, help you quickly identify the cause of problems, and find corresponding solutions. Most importantly, if you have a good understanding of the source code, you will be aware of potential issues in the production environment and be able to avoid them in advance. When solving problems, reading the source code can be an efficient shortcut that maximizes results with minimal effort.

If we consider the time cost, you can allocate the time spent reading the source code to the time spent solving various problems later on. You will find that this is essentially a cost-effective approach.

3. You can also participate in the Kafka open-source community and become a code contributor.

In the community, you can collaborate with Kafka source code contributors from around the world, share and learn from each other. It’s an exciting thing to think about. Especially when your code is adopted by the community, Kafka users worldwide will use the code you have written. It’s truly thrilling, isn’t it?

In summary, there are truly many benefits to reading the source code. It not only enhances your coding skills and improves your architectural techniques, but also efficiently solves practical problems, with numerous advantages and no disadvantages.

How to Master the Core Source Code in the Shortest Time? #

The Kafka codebase has over 500,000 lines. If we were to start reading it directly, we would surely be at a loss.

After all, reading through that much code from start to finish is clearly not efficient. In order to avoid going from beginner to quitting, we need to read the core source code in the most efficient way.

Generally speaking, there are two methods for reading the source code of a large project.

Top-Down Approach: Start from the top or outermost layer of the code and go deeper step by step. In simple terms, start from the main function and gradually go deeper into each layer until you reach the lowest level of code. The benefit of this method is that you traverse the complete top-level functionality path, which is very helpful for understanding the overall flow of each feature.
Bottom-Up Approach: The opposite of the top-down approach, it involves independently reading and understanding the code and implementation mechanisms of each component, and then continuously extending upwards and finally assembling them. This method does not trace back along the functionality dimension, but instead helps you master the underlying base component code.

Both of these methods have their own merits. However, during my study of the Kafka source code, I found that combining the two methods actually yields the most efficient results. This means understanding the purpose of the smallest unit components first, and then connect them together to grasp the functionality of the component combinations.

How to accomplish this? First, you need to determine the smallest unit components. I primarily look at the package structure in the Kafka source code, such as controller, log, server, etc. These packages are basically divided according to the components. The priority order I set for these components is “log->network->controller->server->coordinator->…”, because the later components frequently call the earlier components.

Once you have a clear understanding of the source code structure of a single component, you can try switching to the top-down approach, starting from a major functional point and gradually diving into the source code of each underlying component. Thanks to the previous accumulation of knowledge, you will be very familiar with the basic code encountered during this descent, which will give you a great sense of accomplishment. Compared to using either the top-down or bottom-up approach alone, this mixed method combines the advantages of both.

Regarding how to choose major functional points, I suggest starting with Kafka’s command-line tool for this sequential learning approach. Understand how each step of this tool is implemented, and continuously review the principles of individual components during the descent, while combining these components together.

With each repetition of this process, you will gain a clearer understanding of the interaction logic between various components and become a master of the source code!

Now that we know the approach, we can start studying the Kafka source code. Before diving into the details, let’s take a look at the overall picture of the Kafka source code and find the core source code.

In terms of functionality, the Kafka source code can be divided into four major modules.

Server-side Source Code: Implements the Kafka architecture and various excellent features.
Java Client Source Code: Defines the interaction mechanism with the broker side and provides common support code for the broker-side components.
Connect Source Code: Used to achieve high-performance data transfer between Kafka and external systems.
Streams Source Code: Used to implement real-time stream processing functionality.

As can be seen, the server-side source code is the foundation for understanding the underlying architecture of Kafka, especially the system’s operational principles. The other three modules all have strong dependencies on it. Therefore, the most valuable code in Kafka is undoubtedly the server-side code! Learning this part of the code will give you the highest return on investment.

How is the column designed? #

Well, let’s get started. In this column, based on my understanding of the server-side source code structure, I have carefully selected the following source codes for you.

These source codes are all valuable components and also the “high-risk areas” for many practical online issues. For example, the code logic of Kafka log segment is the “culprit” behind many online exceptions. Mastering these source codes can greatly reduce the time it takes for you to locate problems.

I have divided the server-side source code into 7 modules based on their functionalities, and each module will be further divided into multiple subsections, providing detailed source code analysis at the component level. You can take a look at the focus introduced in this mind map.

Mind map

Enriched flowchart + detailed explanation #

When reading source code, we often make two common mistakes. One is to directly dive into the lowest level of source code line by line, getting stuck in the details; the other is to learn at too high a level, making it no different from not learning at all.

In order to help you learn efficiently, I have abandoned the approach of explaining source code in a greedy and comprehensive way. Instead, I combine “flowcharts + code comments” to provide detailed explanations of key content, and I will also highlight important points based on my practical experience.

Before reading the source code, you can use the images to get a rough understanding of the implementation logic of each method. For key content, I will provide detailed explanations in the form of comments. At the same time, I have also created mind maps to help you summarize and review.

Real case explanations, solving your practical problems #

Many people read source code, but don’t understand what scenarios the source code can be applied to and what problems it can solve. In fact, many of the issues I have encountered in production environments cannot be resolved solely by relying on official documentation or search engines. Only by reading the source code and truly understanding the implementation principles can you find solutions.

In order to help you apply what you have learned, I will share a large number of real cases in this column. It will not only help you avoid pitfalls in advance, but also help you accumulate solutions to common problems, some of which may even be “secret techniques” not documented elsewhere.

Communicating the latest development trends in the community #

This is the most interesting part of the column. The Kafka source code we are learning is constantly evolving every day. In order to master Kafka, you must know the community’s future update plans and major feature improvements.

I will share the latest dynamic information on specific topics. I hope to present to you a vivid and lively community image, rather than just a series of cold code lines. This will make you truly feel involved in the community. Don’t underestimate this feeling, sometimes it can even be the most powerful motivation that supports you through the journey of learning source code.

Extra-curricular extension #

In addition to the above, I will also share some additional topics with you. For example, specific methods to become a code contributor to the Apache Kafka community, practical learning materials, explanations of classic interview questions, etc. I hope you don’t miss out on this part of the content.

Extensions

Finally, I would like to talk to you about the issues with the Scala language. After all, the Broker-side source code we are about to learn is completely based on Scala.

However, this part of the source code does not make extensive use of the more advanced features of Scala. If you have a foundation in the Java language, you don’t need to worry about the language issue, as they have many very similar features.

Even if you are not familiar with the Scala language, it doesn’t matter. You don’t need to fully and systematically learn this language. Just having a simple understanding of the basic functional programming style and a few key features, such as collection traversal and pattern matching, is enough.

Of course, in order not to affect your understanding of the source code covered in this column, I will take you in-depth into the Scala language in the “Introduction” lesson. Also, when encountering more challenging language features in Scala in the column, I will provide specific explanations to you. So, you don’t have to worry about the language issue at all.

Okay, now let’s officially embark on the journey of learning Apache Kafka source code analysis. As the saying goes, “No matter how small the steps, efforts won’t be in vain, and success will be achieved in the end.” Reading source code is a “difficult job,” and I hope you don’t give up easily. After all, mastering the source code puts you ahead of many others.

Finally, I am honored to meet you here and learn and communicate with you. You are also welcome to leave me a message to share your views and questions about Kafka source code analysis.