06 Kafka Online Cluster Deployment Solutions

06 Kafka Online Cluster Deployment Solutions #

In the previous few columns, I gradually sorted out the development of Apache Kafka from various aspects such as its positioning, version changes, and functional evolution. Through these contents, I hope you can have a clear understanding of what Kafka is used for and how to choose Kafka versions in actual production environments, so as to help you get started with Kafka faster.

Now let’s take a look at how to build a Kafka cluster in a production environment. Since it is a cluster, there must be multiple Kafka nodes, because a Kafka pseudo-cluster consisting of a single machine can only be used for daily testing and cannot meet the actual online production needs. A real online environment needs to carefully consider various factors and formulate a plan based on its own business requirements. In the following, I will discuss the aspects of the operating system, disk, disk capacity, and bandwidth separately.

Operating System #

First, let’s take a look at which operating system Kafka should be installed on. Speaking of operating systems, you may wonder, Kafka is a big data framework in the JVM ecosystem, and Java is a cross-platform language. Are there any differences in installing Kafka on different operating systems? The answer is yes, there are significant differences!

Indeed, as you know, Kafka is written in Scala and Java, and the compiled source code is ordinary .class files. In theory, it should be the same to deploy Kafka on any operating system, but the differences between different operating systems have a significant impact on Kafka clusters. Currently, there are three common operating systems: Linux, Windows, and macOS. It should be said that the production environment deployed on Linux is the most common, while there are also Kafka clusters deployed on Windows servers. Although macOS has macOS Server, I doubt if anyone (especially domestic users) really deploys the production environment on Mac servers.

When considering the compatibility between the operating system and Kafka, Linux is undoubtedly more suitable for deploying Kafka than the other two, especially Windows. Although this conclusion may not be surprising to you, you still need to understand the specific reasons. Linux outperforms in the following three aspects.

Usage of I/O models
Efficiency of data network transmission
Community support

Let me explain. First, let’s look at I/O models. What is an I/O model? You can think of an I/O model as the method by which the operating system executes I/O instructions.

The mainstream I/O models usually have five types: blocking I/O, non-blocking I/O, I/O multiplexing, signal-driven I/O, and asynchronous I/O. Each I/O model has its typical use cases. For example, the blocking mode and non-blocking mode of the Socket object in Java correspond to the first two models. The select function used in Linux belongs to the I/O multiplexing model. The famous epoll system call is somewhere between the third and fourth models. As for the fifth model, it is actually rarely supported by Linux systems, but the Windows system provides a thread model called IOCP that belongs to this category.

You don’t need to know the implementation details of each model in detail. In general, we believe that the latter model is more advanced than the former. For example, epoll is better than select. Knowing this level of detail should be enough to handle the content below.

After all that, what is the relationship between I/O models and Kafka? In fact, the Kafka client uses Java’s selector at the underlying level. The implementation mechanism of selector on Linux is epoll, while the implementation mechanism on the Windows platform is select. Therefore, deploying Kafka on Linux has an advantage in obtaining more efficient I/O performance.

The second difference is the efficiency of network transmission. As you know, the messages produced and consumed by Kafka are transmitted through the network, and where are the messages stored? Definitely on the disk. Therefore, Kafka needs to perform a large amount of data transmission between the disk and the network. If you are familiar with Linux, you must have heard of zero-copy technology, which avoids expensive kernel-level data copying during data transmission between the disk and the network, thus achieving fast data transmission. The Linux platform has implemented such zero-copy mechanisms, but unfortunately, on the Windows platform, you have to wait for Java 8 update 60 to “enjoy” this benefit. In summary, deploying Kafka on Linux allows you to enjoy the fast data transmission feature brought by zero-copy technology.

The last factor is the community support. Although this is not an obvious difference, if you are not familiar with it, it may have a greater impact on you than the previous two factors. In simple terms, the community currently does not make any promises regarding Kafka bugs discovered on the Windows platform. Although they verbally promise to do their best to solve the issues, based on my experience, Windows bugs are generally not fixed. Therefore, deploying Kafka on Windows is only suitable for personal testing or functional verification. Never use it in a production environment.

Disks #

If you were to ask which resource is most important for Kafka performance, without a doubt, disks would be ranked high on the list. When planning the disk configuration for a Kafka cluster, the question often arises: should I choose regular mechanical hard drives or solid-state drives (SSDs)? The former has a lower cost and larger capacity but is more prone to damage, while the latter has a greater performance advantage but is more expensive. My advice is to use regular mechanical hard drives.

It is true that Kafka heavily uses disks, but most of its operations are sequential reads and writes, which to a certain extent mitigates the biggest weakness of mechanical hard drives: slow random access operations. From this perspective, using SSDs does not seem to offer significant performance advantages. After all, in terms of cost-effectiveness, mechanical hard drives are a good value. And any shortcomings, such as their susceptibility to damage, are addressed by mechanisms provided by Kafka at the software level. Therefore, using regular mechanical hard drives is a cost-effective choice.

Another frequently discussed topic regarding disk selection is whether to use a redundant array of independent disks (RAID) or not. There are two main advantages of using RAID:

Providing redundant disk storage space
Balancing the workload

Both of these advantages are attractive for any distributed system. However, in the case of Kafka, on one hand, Kafka implements its own redundancy mechanism to provide high reliability. On the other hand, through the concept of partitions, Kafka is able to achieve load balancing at the software level. So the advantages of RAID are not as significant for Kafka. Of course, I’m not saying that RAID is bad. In fact, many large companies do use RAID for Kafka’s underlying storage. However, currently Kafka provides increasingly convenient and reliable solutions for storage, making RAID seem less important in production environments. Taking all these considerations into account, here is my recommendation:

Companies that prioritize cost-effectiveness can skip deploying RAID and use regular disks to create storage space.
Mechanical hard drives are fully capable of handling Kafka in a production environment.

Disk Capacity #

How much storage space does a Kafka cluster actually need? This is a classic planning question. Kafka needs to store messages on the underlying disk, and these messages are automatically deleted after being saved for a certain period of time. Although this period of time can be configured, how should you plan the storage capacity of the Kafka cluster based on your own business scenarios and storage requirements?

Let me give you a simple example to illustrate how to think about this problem. Let’s say your company needs to send 100 million messages to the Kafka cluster every day for a specific business. Each message is saved in duplicate to prevent data loss, and the default retention time for messages is two weeks. Now, let’s assume that the average size of the messages is 1KB. Can you tell me how much disk space your Kafka cluster needs to reserve for this business?

Let’s do the calculation: With 100 million messages of 1KB size being sent every day, saved in duplicate and retained for two weeks, the total space required would be 100 million * 1KB * 2 / 1000 / 1000 = 200GB. In general, a Kafka cluster has other types of data, such as index data, in addition to message data. Therefore, we need to reserve an additional 10% of disk space for these data types, making the total storage capacity 220GB. Since we want to retain the data for two weeks, the overall capacity would be 220GB * 14, which is approximately 3TB. Kafka supports data compression, so let’s assume a compression ratio of 0.75. Therefore, the final storage space you need to plan is 0.75 * 3 = 2.25TB.

In conclusion, when planning disk capacity, you need to consider the following elements:

Number of new messages
Message retention time
Average message size
Number of replicas
Compression enabled or not

Bandwidth #

For frameworks like Kafka, which heavily rely on network data transmission, bandwidth can easily become a bottleneck. In fact, in the real cases I have encountered, insufficient bandwidth resources account for at least 60% of performance issues with Kafka. If your environment involves cross-data center transmission, the situation could be even worse.

Unless you’re extremely wealthy, I would assume that you and I both use regular Ethernet networks, which primarily offer two types of bandwidth: 1Gbps (Gigabit per second) and 10Gbps (10 Gigabits per second). Specifically, 1Gbps Ethernet is commonly found in most company networks. Let me give you an example using a 1Gbps network to illustrate how to plan your bandwidth resources.

Rather than planning bandwidth resources, what you really need to plan is the number of Kafka servers required. Let’s assume your company’s data center environment has a 1Gbps network. Now, you have a business scenario with a target or Service Level Agreement (SLA) to process 1TB of business data within 1 hour. So, how many Kafka servers do you need to accomplish this task?

Let’s do the math. Given that the bandwidth is 1Gbps, which means processing 1Gb of data per second, let’s assume each Kafka server is installed on a dedicated machine, meaning there are no other services co-located on each Kafka machine. However, it is not recommended to do this in real environments. Typically, you can only assume that Kafka will use up to 70% of the bandwidth resources because you need to allocate some resources for other applications or processes.

Based on practical experience, going beyond the 70% threshold may result in network packet loss. Therefore, 70% is a reasonable setting, which means each Kafka server can potentially use a maximum of around 700Mb of bandwidth resources.

Wait a moment, this is only the maximum bandwidth resources it can utilize. You cannot allow Kafka servers to routinely use this much resources. Typically, you need to reserve an additional 2/3 of the resources, which means each server will use approximately 240Mbps of bandwidth. It is worth noting that this 2/3 value is actually quite conservative, and you can decrease the value based on your own machine usage.

Now, with 240Mbps, we can calculate the number of servers required to process 1TB of data within 1 hour. According to this goal, we need to process 2336Mb of data per second, divided by 240, which is approximately 10 servers. If the messages also need to be replicated twice, then the total number of servers needs to be multiplied by 3, which is 30 servers in total.

There you go, it’s quite simple. Using this method to evaluate the number of servers in your production environment is reasonable, and this method can be dynamically adjusted as your business requirements change.

Summary #

As the saying goes, “Before the troops move, the food and forage go ahead.” Instead of blindly setting up a Kafka environment and then struggling to adjust it later, it is better to think carefully about the cluster environment required for actual business needs from the beginning. When considering the deployment scheme, it is necessary to consider the overall situation and not evaluate it from a single dimension. I believe that after our discussion today, you must have a clear understanding of how to plan a Kafka production environment. Now let me summarize today’s key points:

Open Discussion #

Do you have any questions about the assessment method I presented today? Can you think of any improvements?

Feel free to write down your thoughts or questions, and let’s discuss together. If you feel like you’ve gained something from this, please feel free to share the article with your friends.