08 Database Optimization Solution I How to Do Master Slave Separation When Query Requests Increase

08 Database Optimization Solution I - How to Do Master-Slave Separation When Query Requests Increase #

Hello, I am Tang Yang.

In the previous lesson, we solved the problem of database connection reuse by using connection pooling technology. At this point, although the overall architecture of your vertical e-commerce system has not changed, the process of interacting with the database has changed. We have added a database connection pool between your web application and the database, reducing the cost of frequent connection creation. Based on the test results from the previous lesson, the performance has improved by 80%. The current architecture diagram is as follows:

At this point, your database is still deployed as a single server. According to benchmark results from some cloud vendors, when running MySQL 5.7 on a 4-core 8GB machine, it can support approximately 500 TPS (transactions per second) and 10,000 QPS (queries per second). However, the operations manager says they are preparing for the Double Eleven (November 11th) promotion, and the company will continue to invest in promoting the omnichannel. This will undoubtedly lead to a sudden increase in query volume. Therefore, today, we will discuss how to implement master-slave replication to solve the problem when the query load increases.

Master-Slave Replication #

In fact, the access model of most systems involves more reads than writes, and the gap between read and write request volumes can be several orders of magnitude.

This is easy to understand. The number of requests to view moments on social media is definitely greater than the number of requests to post moments. Similarly, the number of views of a product on a platform like Taobao is definitely much higher than the number of orders placed for it. Therefore, we prioritize how the database can withstand higher query requests. To do this, you first need to distinguish between read and write traffic, so that you can easily scale the read traffic independently. This is what we call master-slave replication.

In essence, it is a traffic separation problem, much like traffic control on a road. On a four-lane road, three lanes are reserved for leaders and foreign guests, while the remaining lane is for us. The priority is to ensure that the leaders go first, which is the principle behind it.

This method itself is a common practice, and even in a large project, it is an effective approach to deal with unexpected increases in read traffic to the database.

In my current project, we have encountered issues where a sudden surge in front-end traffic led to excessive load on the slave database. The DBA team will prioritize expanding the slave database, so that the read traffic will be distributed across multiple slave databases, reducing the load on each one. Then, the development team will consider using other solutions to handle the traffic above the database layer.

Two key points for master-slave read-write #

In general, in the master-slave read-write separation mechanism, we copy the data of a database into one or more databases and write it to other database servers. The original database is called the master database, which is mainly responsible for data writing, and the copied target database is called the slave database, which is mainly responsible for supporting data querying. As we can see, there are two key points in master-slave read-write separation:

One is the data copy, which we call master-slave replication;
In the case of master-slave separation, how do we shield the changes in the way of accessing the database caused by master-slave separation, so that developers can use it as if they were using a single database.

Next, let’s take a look at each one.

1. Master-Slave Replication #

Let me first explain master-slave replication using MySQL as an example.

MySQL’s master-slave replication is dependent on binlog, which records all changes on MySQL and saves them in binary log files on disk. Master-slave replication is the process of transferring the data in binlog from the master database to the slave database. Generally, this process is asynchronous, meaning that operations on the master database do not wait for the binlog synchronization to complete.

The process of master-slave replication is as follows: First, when the slave database connects to the master node, it creates an IO thread to request the updated binlog from the master database and writes the received binlog information into a log file called relay log. The master database also creates a log dump thread to send the binlog to the slave database. At the same time, the slave database creates an SQL thread to read the content of the relay log and replay it in the slave database, ultimately achieving consistency between the master and slave. This is a common method of master-slave replication.

In this approach, using an independent log dump thread is an asynchronous way to avoid affecting the main update process on the master database. After the slave database receives the information, it does not write it directly into the storage of the slave database, but writes it into a relay log. This is to avoid the time-consuming process of writing into the actual storage of the slave database, which could cause the delay between the slave and master to increase.

You will notice that, for performance reasons, the writing process on the master database does not wait for the completion of master-slave synchronization before returning the results. In extreme cases, for example, if the binlog on the master database has not had time to be flushed to the disk but there is a disk failure or power loss, it will result in the loss of binlog and ultimately lead to inconsistent data between the master and slave. However, the probability of this situation occurring is very low, and it is acceptable for internet projects.

After implementing master-slave replication, we can write to the master database during write operations and read from the slave database during read operations. This ensures that write requests that may lock the table or record do not affect the execution of read requests. Additionally, in cases where there is a high volume of read traffic, we can deploy multiple slave databases to handle the read traffic together. This is known as “one master, multiple slaves” deployment, which can be used in your vertical e-commerce project to resist higher concurrent read traffic. Furthermore, the slave database can also be used as a backup database to avoid data loss in the event of a master database failure.

Now, you might ask, can I increase the number of slave databases indefinitely to handle a large amount of concurrent traffic? In reality, this is not the case. As the number of slave databases increases, there are more IO threads connecting to the slave databases, and the master database also needs to create an equal number of log dump threads to handle replication requests. This consumes a lot of resources on the master database and is also limited by the network bandwidth of the master database. Therefore, in practical use, a master database generally supports a maximum of 3-5 slave databases.

Of course, master-slave replication also has some disadvantages. In addition to the complexity it brings to deployment, it also introduces a certain delay in master-slave synchronization. This delay can sometimes have an impact on business operations. Let me give you an example to explain.

In the process of posting a tweet, there are synchronous operations, such as updating the database, and there are also asynchronous operations, such as synchronizing the tweet information to the moderation system. Therefore, after updating the master database, we write the tweet ID into a message queue, which is then retrieved by a queue processor to fetch the tweet information from the slave database and send it to the moderation system. If there is a delay in the synchronization between the master and slave databases, the tweet information may not be available in the slave database, and the entire process will encounter exceptions.

There are many ways to solve this problem, and the key idea is to minimize querying the slave database. In the specific example mentioned earlier, there are three possible solutions:

The first solution is data redundancy. Instead of just sending the tweet ID to the message queue, you can send all the tweet information required by the queue processor to avoid re-querying the data from the database. The second solution is to use caching. I can write the Weibo data into the Memcached cache while synchronously writing to the database. This way, the queue processing machine will prioritize querying the cache when retrieving Weibo information, thus ensuring data consistency.

The final solution is to query the master database. I can modify the queue processing machine to query the master database instead of the slave database. However, this approach should be used with caution and only if the query volume is not large and within the capacity of the master database. Otherwise, it will put a significant load on the master database.

I would prioritize the first solution because it is simple enough. However, it may result in larger individual messages, increasing bandwidth and time for message sending.

The caching solution is more suitable for scenarios with new data. In scenarios where data is updated, updating the cache first may result in data inconsistency. For example, if two threads simultaneously update the data, and thread A updates the data in the cache to 1 while thread B updates the data in the cache to 2, then thread B updates the data in the database to 2, and finally thread A updates the data in the database to 1. In this case, the value in the database (1) and the value in the cache (2) are inconsistent.

Finally, I would not use the third solution unless absolutely necessary. The reason is that this solution requires providing an interface to query the master database, and it is difficult to ensure that other team members will not misuse this method during team development. If the master database is overwhelmed with a large number of read requests and crashes, it would have a significant impact on the overall system.

Therefore, for these three solutions, you need to make a choice and select the one that suits the actual project situation.

In addition, the delay in master-slave synchronization is an easily overlooked issue when troubleshooting. Sometimes when we encounter strange issues where we cannot retrieve information from the slave database, we may suspect that there is some logic in the code that deletes previously written content. However, we may also find that after a period of time, we can retrieve the data again. This is mainly caused by the delay in master-slave synchronization. Therefore, we usually monitor and send alerts when the delay in the slave database exceeds a certain threshold in milliseconds. If the delay reaches the level of seconds, it is considered abnormal and requires immediate attention.

2. How to Access the Database #

We have already used the master-slave replication technology to replicate data to multiple nodes and achieve database read-write separation. As a result, the way we use the database has changed. Previously, we only needed to use one database address, but now we need to use one master database address and multiple slave database addresses, and we need to distinguish between write operations and read operations. If we combine it with the content discussed in the next section, “Sharding”, the complexity will increase even further. In order to reduce complexity, the industry has developed many database middleware solutions to solve the problem of accessing databases, and these middleware can be divided into two categories.

The first category, represented by Taobao’s TDDL (Taobao Distributed Data Layer), is embedded in the application code. You can think of it as a proxy for data sources, where the configuration manages multiple data sources, with each data source corresponding to a database, which could be a master or a slave. When a database request comes in, the middleware sends the SQL statement to a specified data source for processing and returns the results.

The advantages of this type of middleware are that it is simple and easy to use, with no additional deployment costs since it is embedded in the application code. Therefore, it is more suitable for small teams with weaker operation capabilities. However, it lacks support for multiple programming languages. Currently, the mainstream solutions in this category, besides TDDL, include the earlier Netease DDB, which are all developed in Java and do not support other languages. Additionally, upgrading the version relies on the user’s update, making it somewhat difficult.

The second category is the separate deployment of proxy solutions, which is more diverse. Examples include Cobar, an early open-source project from Alibaba, Mycat, developed based on Cobar, Atlas from 360, and DBProxy from Meituan, which is based on Atlas.

These middleware solutions are deployed on separate servers, and the business code uses them as if they were using a single database. In fact, they internally manage many data sources. When there is a database request, they will do necessary rewriting of the SQL statement and then send it to the designated data source.

They generally use the standard MySQL communication protocol, so they support multiple programming languages well. Because they are deployed separately, it is also convenient to maintain and upgrade them, making them more suitable for medium to large teams with a certain level of operation capabilities. The disadvantage is that all SQL statements need to cross the network twice: from the application to the proxy layer and from the proxy layer to the data source, resulting in some performance overhead.

These middleware solutions may not be unfamiliar to you. However, I want you to pay attention to the fact that when using any middleware, you must have a deep understanding of it. Otherwise, if a problem arises and you cannot solve it quickly, it can be disastrous.

In one of my previous projects, we used a self-developed component for sharding. Later, we discovered that this component had a certain probability of creating redundant connections to the database. So, after discussing with the team, we decided to replace it with Sharding-JDBC. Originally, we thought it would be a simple component switch, but after going live, we encountered two problems: first, due to incorrect usage, there were occasional cases where sharding did not take effect and all databases and tables were scanned; second, there were occasional delays in query execution, reaching the level of seconds. Due to a lack of sufficient knowledge about Sharding-JDBC, we were unable to quickly resolve these two issues. As a result, we had no choice but to switch back to the original component and then perform the switch after finding the root causes of the problems.

Summary of the Course #

In this lesson, I have introduced how we can handle increased database traffic through master-slave replication and one master multiple slave deployment. Apart from mastering the technique of master-slave replication, you also need to understand the issues that arise with master-slave separation and the solutions to these issues. The key points that I want you to understand are as follows:

Master-slave read-write separation and deployment of one master multiple slaves can handle sudden increase in database read traffic, which is a method of horizontally scaling the database.
After read-write separation, the delay between the master and slave is a critical monitoring metric, which may result in the situation where data cannot be read immediately after it is written.
The industry has many solutions to shield the details of accessing the database after master-slave separation, allowing developers to access it as if it were a single database. These solutions include embedded solutions like TDDL and Sharding-JDBC, as well as independent proxy solutions like Mycat.

In fact, we can extend master-slave replication to the technology of replicating storage data between storage nodes, which can achieve data redundancy and backup, as well as improve horizontal scalability. When using the technique of master-slave replication, you usually need to consider two issues:

The trade-off between the consistency of the master and slave and the performance of write operations. If you want to ensure that all slave nodes successfully write data, the write performance will be affected. If you only write to the master node and return success, there may be data synchronization failures in the slave nodes, resulting in inconsistent master and slave. However, in Internet projects, we generally prioritize performance over strong data consistency.
The issue of master-slave delay, which can be related to many strange problems of not being able to read data. If you encounter such problems, you can first check the data of master-slave delay.

Many components we use employ this technology. For example, Redis implements read-write separation through master-slave replication; Elasticsearch replicates index shards stored in multiple nodes; and files written to HDFS are replicated to multiple DataNodes. However, different components have different requirements for replication consistency and delay, so they adopt different solutions. But the design philosophy behind this is universal and something you need to understand so that you can apply it when learning about other storage components.