35 Cross Cluster Backup Solutions Mirror Maker

35 Cross-Cluster Backup Solutions MirrorMaker #

Hello, I’m Hu Xi. Today, I want to share with you the topic of MirrorMaker, a cross-cluster data mirroring tool for Kafka.

In general, we use a set of Kafka clusters to handle business operations. However, in some scenarios, we may need multiple Kafka clusters to work simultaneously. For example, to facilitate disaster recovery, you can deploy separate Kafka clusters in two data centers. If one data center fails, you can easily redirect the traffic to the other functioning data center. Another scenario is when you want to provide low-latency message services to geographically nearby customers, but your main data center is far from them. In this case, you can deploy a set of Kafka clusters closer to your customers to serve them and provide low-latency services.

To meet these requirements, in addition to deploying multiple Kafka clusters, you need a tool or framework to help you copy or mirror data between clusters.

It is worth noting that we usually refer to copying data between different nodes within a single cluster as backup, and copying data between clusters as mirroring.

Today, I will focus on introducing the MirrorMaker tool provided by the Apache Kafka community, which can help us copy messages or data from one cluster to another.

What is MirrorMaker? #

Essentially, MirrorMaker is a program that acts as a consumer and a producer. The consumer is responsible for consuming data from the source cluster, while the producer is responsible for sending messages to the target cluster. The entire mirroring process is depicted in the following diagram:

MirrorMaker connects the source cluster with the target cluster, ensuring real-time synchronization of messages. However, it is important to note that you can use multiple instances of MirrorMaker to connect different upstream and downstream clusters.

Let’s take a look at the following diagram. It illustrates a scenario where three clusters are deployed: the source cluster on the left handles the primary business processing, the target cluster in the top right corner is used for data analysis, and the target cluster in the bottom right corner serves as a hot backup for the source cluster.

Running MirrorMaker #

Kafka provides a command-line tool called kafka-mirror-maker by default. Its common usage is to specify the consumer configuration file, producer configuration file, number of streams, and the regular expression pattern of topics to be mirrored. For example, the following command is a typical MirrorMaker execution command.

$ bin/kafka-mirror-maker.sh --consumer.config ./config/consumer.properties --producer.config ./config/producer.properties --num.streams 8 --whitelist ".*"

Now let me explain the meaning of each parameter in this command.

  • The consumer.config parameter specifies the file path of the consumer configuration in MirrorMaker. The main configuration item is bootstrap.servers, which refers to the Kafka cluster from which MirrorMaker reads messages. Since MirrorMaker may create multiple consumer instances internally and use consumer group mechanism, you also need to set the group.id parameter. Additionally, I recommend you to configure auto.offset.reset=earliest as well. Otherwise, MirrorMaker will only copy messages arrived at the source cluster after its startup.

  • The producer.config parameter specifies the file path of the producer configuration used internally by MirrorMaker. In general, Kafka Java Producer is very friendly and does not require many configurations. The only exception is still the bootstrap.servers parameter, which you must explicitly specify to indicate the target cluster to which the copied messages should be sent.

  • The num.streams parameter. Personally, I think the name of this parameter can easily cause misunderstandings. The first time I saw this parameter, I even thought that MirrorMaker was implemented using the Kafka Streams component. But it’s not. This parameter simply tells MirrorMaker how many KafkaConsumer instances it should create. Of course, it uses a multi-threaded approach, that is, creating and starting multiple threads in the background, with each thread maintaining a dedicated consumer instance. When using it in practice, you can set multiple threads based on the performance of your machine.

  • The whitelist parameter. As shown in the command, this parameter accepts a regular expression. All topics that match this regular expression will be automatically mirrored. In this command, I specified “.*”, which means I want to synchronize all topics on the source cluster.

MirrorMaker Configuration Example #

Now, let me demonstrate the usage of MirrorMaker in a test environment.

The demonstration process is roughly as follows: first, we will start two Kafka clusters, both of which are single-node pseudo clusters listening on ports 9092 and 9093 respectively; then, we will start the MirrorMaker tool to mirror the messages from the 9092 cluster to the 9093 cluster in real-time; finally, we will start additional consumers to verify if the messages are successfully copied.

Step 1: Start two Kafka clusters #

The startup logs are shown below:

[2019-07-23 17:01:40,544] INFO Kafka version: 2.3.0 (org.apache.kafka.common.utils.AppInfoParser)- [2019-07-23 17:01:40,544] INFO Kafka commitId: fc1aaa116b661c8a (org.apache.kafka.common.utils.AppInfoParser)- [2019-07-23 17:01:40,544] INFO Kafka startTimeMs: 1563872500540 (org.apache.kafka.common.utils.AppInfoParser)- [2019-07-23 17:01:40,545] INFO [KafkaServer id=0] started (kafka.server.KafkaServer)

[2019-07-23 16:59:59,462] INFO Kafka version: 2.3.0 (org.apache.kafka.common.utils.AppInfoParser)- [2019-07-23 16:59:59,462] INFO Kafka commitId: fc1aaa116b661c8a (org.apache.kafka.common.utils.AppInfoParser)- [2019-07-23 16:59:59,462] INFO Kafka startTimeMs: 1563872399459 (org.apache.kafka.common.utils.AppInfoParser)- [2019-07-23 16:59:59,463] INFO [KafkaServer id=1] started (kafka.server.KafkaServer)

Step 2: Start the MirrorMaker tool #

Before starting the MirrorMaker tool, we need to prepare the consumer configuration file and the producer configuration file mentioned earlier. The contents of these files are as follows:

consumer.properties:
bootstrap.servers=localhost:9092
group.id=mirrormaker
auto.offset.reset=earliest

producer.properties:
bootstrap.servers=localhost:9093

Now, let’s run the command to start the MirrorMaker tool.

$ bin/kafka-mirror-maker.sh --producer.config ../producer.config --consumer.config ../consumer.config --num.streams 4 --whitelist ".*"
WARNING: The default partition assignment strategy of the mirror maker will change from 'range' to 'roundrobin' in an upcoming release (so that better load balancing can be achieved). If you prefer to make this switch in advance of that release add the following to the corresponding config: 'partition.assignment.strategy=org.apache.kafka.clients.consumer.RoundRobinAssignor'

Please make sure to read the warning message in the command output carefully. This warning means that in future versions, the internal consumers of MirrorMaker will use the round-robin strategy to assign partitions to consumer instances, while the current default strategy is still based on the range partition strategy. The range strategy simply arranges all partitions together in a certain order, and each consumer takes turns to consume the partitions.

The round-robin strategy was introduced later than the range strategy. Generally, we can assume that the partition assignment strategy introduced later by the community is better than the previous one, in terms of achieving a more balanced distribution. The last part of this warning message reminds us that if we want to “enjoy” the round-robin strategy earlier, we need to manually add the setting partition.assignment.strategy=org.apache.kafka.clients.consumer.RoundRobinAssignor in the consumer.properties file.

Step 3: Verify if the messages are successfully copied #

Okay, after starting MirrorMaker, we can send and consume some messages to/from the source cluster, and then verify if all topics are successfully synchronized to the target cluster.

Assuming we created a 4-partition topic named “test” on the source cluster, and then used the kafka-producer-perf-test script to simulate sending 5 million messages. Now, we use the following two commands to check if the target Kafka cluster has the topic named “test” and if it has successfully mirrored these messages.

$ bin/kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list localhost:9093 --topic test --time -2
test:0:0

$ bin/kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list localhost:9093 --topic test --time -1
test:0:5000000

-1 and -2 respectively represent getting the latest offset and the earliest offset of a partition. The difference between these two offset values is the number of messages currently in the partition. In this example, the difference is 5 million, indicating that the “test” topic currently has 5 million messages written. In other words, MirrorMaker has successfully synchronized these 5 million messages to the target cluster.

At this point, you must be wondering why, even though we created a 4-partition topic on the source cluster, it became a single partition on the target cluster.

The reason is simple. During the mirroring process, if MirrorMaker finds that the target cluster does not have the topic to be synchronized, it will automatically create the topic based on the default values of the broker-side parameters num.partitions and default.replication.factor. In this example, we have not created any topics on the target cluster, so when mirroring starts, MirrorMaker automatically creates a topic named “test” with one partition and one replica.

In real-world scenarios, I recommend that you create all the topics to be synchronized on the target cluster with the same specifications as those on the source cluster. Otherwise, serious problems may occur, such as messages originally located in one partition being mirrored to other partitions. If your message processing logic depends on such partition mapping, you will inevitably encounter issues.

In addition to regular Kafka topics, MirrorMaker defaults to synchronize internal topics, such as the offsets topic that we mentioned frequently earlier in this column. When mirroring the offsets topic, if MirrorMaker finds that it has not been created on the target cluster, it will determine the specifications of this topic based on the values of the broker-side parameters offsets.topic.num.partitions and offsets.topic.replication.factor. The default configuration is 50 partitions and 3 replicas per partition.

Before version 0.11.0.0, Kafka does not strictly follow the value of the offsets.topic.replication.factor parameter. This means that if you set this parameter value to 3, but the number of alive brokers is less than 3, the offsets topic can still be successfully created, with the replica count being the smaller value between this parameter value and the number of alive brokers.

This flaw was fixed in version 0.11.0.0, which means that Kafka strictly follows the parameter value you set, and if it finds that the number of alive brokers is less than the parameter value, it will directly throw an exception to inform you that the topic creation has failed. Therefore, when using MirrorMaker, you must ensure that these configurations are reasonable.

Other Cross-Cluster Mirroring Solutions #

Now that we have covered the main functions of MirrorMaker, you can see that executing MirrorMaker commands is quite simple, but its functionality is limited. In fact, the maintenance cost is relatively high. For example, managing topics is very inconvenient, and it is also difficult to pipeline it.

For these reasons, many companies in the industry choose to develop their own cross-cluster mirroring tools. Let me briefly introduce a few of them.

  1. Uber’s uReplicator Tool

Uber also used to use MirrorMaker, but during its usage, they discovered some obvious defects. For example, MirrorMaker uses a consumer group mechanism for consuming, which inevitably encounters many issues with rebalancing.

To address this, Uber developed their own tool called uReplicator. It uses Apache Helix as a centralized topic partition management component and rewrites the consumer program to replace the consumer in MirrorMaker. It uses Helix to manage partition assignments, thus avoiding various rebalancing problems.

At this point, I have a small reflection: the community is currently making great efforts to optimize the consumer group mechanism to improve various scenarios caused by rebalancing. However, other framework developers do not use the group mechanism. They prefer to develop their own mechanisms to maintain the mapping of partition assignment. All these indicate that there is still a lot of room for improvement in the consumer group of Kafka.

In addition, Uber specifically wrote a blog that details the design principles of uReplicator and lists some defects of the community’s MirrorMaker tool, as well as uReplicator’s response methods. I highly recommend you read this blog.

  1. Brooklin MirrorMaker Developed by LinkedIn

To address the difficulty of achieving pipelining with the existing MirrorMaker tool, this tool has been specifically improved and optimized for performance. Currently, at LinkedIn, Brooklin MirrorMaker has completely replaced the community version of MirrorMaker. If you want to understand how it achieves this, I recommend you read a blog in detail.

  1. Replicator Developed by Confluent

This tool provides an enterprise-level cross-cluster mirroring solution and is known to have the most powerful functionality on the market. It can easily provide migration of Kafka topics between different clusters for you. In addition, the Replicator tool can automatically create topics on the target cluster with the same configuration as the source cluster, greatly facilitating operations and management. However, where there are pros, there are cons. Replicator is a paid tool. If your company has a sufficient budget and you are concerned about the quality of data migration between multiple clusters or even multiple data centers, you may want to consider Confluent’s Replicator tool.

Summary #

Alright, let’s summarize what we’ve learned about MirrorMaker today. It is a cross-cluster mirroring solution provided by the Apache Kafka community, mainly designed to replicate or mirror Kafka messages from one cluster to another in real-time. MirrorMaker can be applied to many practical scenarios, such as data backup and active-passive clusters. MirrorMaker itself has simple functionality and flexible application, but it also comes with disadvantages like high operational costs and poor performance. Therefore, some vendors in the industry have developed their own mirroring tools. You can choose the appropriate tool based on your specific business needs to help you accomplish cross-cluster data backup.

Open Discussion #

Today, I only demonstrated the basic usage of MirrorMaker, which is to move messages as they are. If we want to perform some processing on the messages before they are mirrored, such as modifying the message content, how should we implement it? (Hint: Specify the –message.handler parameter of the kafka-mirror-maker script.)

Feel free to share your thoughts and answers, and let’s discuss together. If you find it helpful, please feel free to share this article with your friends.