Extra Meal How Should Data Migration Be Done

Extra Meal - How Should Data Migration Be Done #

Hello, I’m Tang Yang.

In the article “Database Optimization Plan (Part 2): How to Achieve Sharding When the Amount of Data Increases,” I mentioned that due to MySQL not supporting auto sharding like MongoDB, when you need to split a MySQL database into multiple databases or when you need to split multiple databases into even more databases due to storage limitations, you need to consider how to perform data migration.

In fact, in practical work, data migration is not only done when splitting a database, but also in many other scenarios. There are many situations where you need to provide a data migration plan, such as when your boss wants to migrate the application from a self-built data center to the cloud. In this case, you need to consider migrating all the data from the self-built data center, including MySQL, Redis, message queues, and other components, to the cloud. This is a monumental task for any company, so you need to prepare a comprehensive migration plan before starting the migration.

The issue of “data migration” is important and tedious, and it is a focal point for both developers and operations colleagues. While updating the course, I have noticed many students, such as @每天晒白牙, @枫叶11, @撒旦的堕落, etc., inquiring about how to perform data migration in the comments section. Therefore, I have planned an extra course to help you master the methods of data migration from both the database and cache migration perspectives. I will also highlight key points to pay attention to during the data migration process, aiming to help you avoid pitfalls as much as possible.

How to smoothly migrate data in the database #

You might think that data migration is simply copying data from one database to another. It can be achieved by real-time data copying through MySQL master-slave synchronization. You can also export data from the source database using the mysqldump tool and import it into the new database. What’s so complicated about it?

In fact, these two methods only support the migration from a single database to another single database, and cannot support scenarios where data needs to be migrated from a single database to multiple databases or tables. Moreover, even in the case of a single database to a single database migration, the migration process needs to fulfill the following goals:

The migration should be online, which means that data can still be written during the migration.
Data integrity should be ensured, which means that the data in the new database should be consistent with the data in the old database after the migration.
The migration process should be rollbackable, so that if any problems occur during the migration, it is possible to immediately roll back to the source database without affecting the system’s availability.

If you use Binlog synchronization and then modify the code after the synchronization is complete, changing the master database to the new database, it does not meet the rollback requirement. Once problems are discovered after the migration and incremental data has been written to the new database without being written to the old database, it’s not possible to switch the database back to the old one.

Generally, we have two solutions for database migration.

“Dual-write” solution #

The first solution I call “dual-write”. In fact, it’s quite simple and can be divided into the following steps:

Configure the new database as a slave of the source database to synchronize data. If you need to synchronize data to multiple databases or tables, you can use third-party tools to obtain the incremental logs of Binlog (such as the open-source tool Canal). After obtaining the incremental logs, you can write them into the new databases and tables according to the sharding logic.
At the same time, we need to modify the business code to write data not only to the old database but also to the new database. Of course, for performance reasons, we can write to the new database asynchronously, as long as the write to the old database is successful. However, it’s important to note that data that fails to be written to the new database needs to be recorded in a separate log. This will facilitate subsequent data repairs to ensure data consistency between the new and old databases.
Then, we can start verifying the data. Since there is a large amount of data in the database, performing a full data verification is not very practical. You can randomly select a portion of the data to verify. The specific amount depends on the overall data volume, as long as this subset of data is consistent.
If everything goes smoothly, we can switch the read traffic to the new database. As there may be unknown impacts on the system from switching all read traffic at once, it’s best to use a gradual approach for switching, such as initially switching 10% of the traffic, and if there are no issues, gradually increasing to 50% and finally 100%.
With dual-write in place, if any problems occur during the switching process, you can switch the read and write traffic back to the old database at any time to ensure system performance.
After observing the data migration for a few days and confirming that there are no issues, you can modify the dual-write operation to only write to the new database, completing the migration of the data. Among them, the step most prone to issues is the data validation work. Therefore, I suggest that you first write a data validation tool or script before starting the data migration. After thoroughly testing it in the testing environment, you can proceed with the formal data migration.

If you are migrating data from a self-built data center to the cloud, you can also use this solution. However, one important factor you need to consider is: the bandwidth and latency of the dedicated line from the self-built data center to the cloud. You should try to minimize cross-dedicated line read operations. Therefore, when switching read traffic, you need to ensure that the application server in the self-built data center reads the database in the same data center, and the application server in the cloud reads the database in the cloud. In this way, before completing the migration, you just need to shut down the application server in the self-built data center and redirect the write traffic to the new database.

This solution is relatively generic and can be used whether you are migrating data in MySQL, Redis, or even a message queue. You can directly use it in your actual work.

The advantage of this approach is that the migration process can be rolled back at any time, minimizing the migration risk. The disadvantage is that it takes a relatively long time and there is a cost to modify the application.

Cascading Synchronization Solution #

This solution is also relatively simple and suitable for the scenario where data is migrated from a self-built data center to the cloud. When migrating to the cloud, the biggest concern is that the cloud environment and the self-built data center environment may differ, which can cause database issues when running in the cloud due to parameter configurations or different hardware environments.

Therefore, we set up a backup database in the self-built data center and a new database in the cloud environment for cascading synchronization, maintaining a rollbackable database in the self-built data center. The specific steps are as follows:

Configure the new database as a slave database of the old database for data synchronization.
Configure a backup database as a slave database of the new database for data backup.
After the write consistency is achieved among the three databases, switch the read traffic to the new database.
Suspend write operations and switch the write traffic to the new database (since write operations need to be suspended, this should be scheduled during the business’s off-peak period).

The rollback plan for this approach is relatively simple. You can switch the read traffic back to the backup database, then suspend write operations and switch the write traffic back to the backup database. This means that all traffic is now directed to the backup database, effectively returning to the self-built data center environment.

The cascading migration approach described above can be applied to scenarios where MySQL is migrated from a self-built data center to the cloud, as well as scenarios where Redis or similar databases are migrated. If you have similar requirements, you can directly apply this approach.

The advantage of this approach is its simplicity and ease of implementation, with minimal impact on the business. The disadvantage is the short pause in write operations when switching, which is a small trade-off for the business. However, if the write switching is performed during the off-peak period, the impact on the business can be minimized.

How to Preheat the Cache during Data Migration #

Furthermore, when migrating data from a self-built data center to the cloud, we also need to consider the caching migration plan. You may say: “Cache is originally designed as an intermediate storage, so I only need to deploy an empty cache node in the cloud, and the requests from the cloud will penetrate to the cloud database and then populate the cache, without affecting the business.”

You are correct, but you also need to consider the cache hit rate.

If you deploy an empty cache, all requests will pass through to the database. The database may crash under the pressure, resulting in unavailability of your service. Therefore, the key to cache migration is to maintain cache hotness.

As I mentioned earlier, Redis data migration can be done using dual-write or cascading synchronization. So here, I will not consider Redis cache synchronization; instead, I will use Memcached as an example to illustrate.

Preheating the Cache with Replication Group #

In the article “Cache Usage Guide (Part 2): How to Achieve Cache High Availability?”, I mentioned that in order to ensure the availability of the cache, we can deploy multiple replication groups to block requests at the database layer as much as possible.

The data writing process is to write to the master, slave, and all replication groups. When reading data, the data is first read from the replication groups. If it is not found, it will be loaded from the master and slave, and then written back to the replication groups. Therefore, we can deploy a replication group in the cloud, so that the application servers in the cloud read from the replication group in the cloud. If the replication group does not find the data, it can load the data from the main-slave cache deployed in the self-built data center and write it back to the replication group in the cloud.

Once the replication group deployed in the cloud becomes hot enough, that is, the cache hit rate reaches at least 90%, you can point the master-slave cache servers in the cloud data center to this replication group, completing the migration.

This approach is simple enough, but there is a fatal flaw: If the requests from the cloud penetrate the replication group in the cloud and reach the main-slave cache deployed in the self-built data center, this process requires crossing dedicated lines.

Not only does it occupy a large amount of dedicated line bandwidth, but the latency of dedicated lines is much larger compared to cache reading time. Generally, even the latency between local data centers reaches 2ms-3ms. Therefore, a frontend request may access the cache dozens of times or even more. This will increase the delay by tens of milliseconds or even hundreds of milliseconds, greatly affecting the response time of the interface. Therefore, we rarely use this approach in actual projects.

However, this approach gives us an idea that allows us to automatically preheat the cache during system operation through the design of the solution. Therefore, we made some modifications to the replication group solution to minimize the occupation of dedicated line bandwidth.

Modifying the Replication Group Solution to Preheat the Cache #

The modified solution modifies the way of reading and writing the cache, with the following steps:

Deploy multiple replication groups of Memcached in the cloud. When the self-built data center receives a write request, it will first write it to the cache node in the self-built data center and then asynchronously write it to the Memcached nodes deployed in the cloud.
When processing read requests from the self-built data center, a certain amount of traffic, for example, 10%, will be directed to the cloud cache nodes first. Although this also involves penetrating back to the cache nodes in the self-built data center through dedicated lines, the traffic is controllable.
When the cache hit rate of the cloud cache nodes reaches over 90%, you can deploy application servers in the cloud and have them fully use the cloud cache nodes.

Using this approach, we can achieve cache data migration and control the occupation of dedicated line bandwidth and request latency as much as possible. You can also use this approach directly in your own projects.

Summary of the Course #

The data migration solutions mentioned above are the ones I often use in practical projects and have been tested in actual combat. I hope that through this course, you can apply these solutions to your projects to solve practical problems. At the same time, I would like to emphasize the key points of this course again:

The dual-write solution is a common solution for database and Redis migration, and you can directly use it in your work. The most important aspect of the dual-write solution is to ensure data consistency through data validation, so that you can roll back at any time during the migration process.

If you need to migrate data from a self-built data center to the cloud, you can consider using the cascading replication solution. This solution will cause a temporary write stop, which should be executed during off-peak hours.

The focus of cache migration is to ensure the cache hit rate in the cloud. You can use an improved replica group method to migrate, where you asynchronously write data to the cloud replica group when writing to the cache. During reading, a small amount of traffic is sent to the cloud replica group. This way, you can migrate part of the data to the cloud replica group while minimizing the latency caused by the dedicated line delay to the self-built data center.

If you are the project leader, you must develop a thorough plan during the migration process. If it is a database migration, data validation should be the issue you spend the most time on resolving.

If you are migrating a self-built data center to the cloud, the bandwidth of the dedicated line will be a bottleneck during the migration process. You need to carefully review before the migration, which calls need to go through the dedicated line and how much bandwidth they occupy, and whether the bandwidth latency can meet the requirements. In your plan, you should also try to minimize the occupation of dedicated line bandwidth resources by the services, cache, and database calls within the same data center during the migration process.