28 What are the Pitfalls of Read-Write Separation #

In the previous article, I introduced the structure of one master and multiple slaves, as well as the switching process. Today, let’s continue talking about the application scenarios of one master and multiple slaves architecture: read-write separation, and how to handle the problems caused by master-slave delay.

The structure of one master and multiple slaves mentioned in the previous article is actually the basic structure of read-write separation. Here, I’ll paste the diagram again for your convenience.

Figure 1: Basic structure of read-write separation

The main goal of read-write separation is to reduce the load on the master database. In the structure shown in Figure 1, the client actively performs load balancing, and in this mode, the database connection information is usually placed in the client’s connection layer. In other words, the client chooses the backend database for querying.

Another architecture is to have a middle proxy layer between MySQL and the client, where the client only connects to the proxy, and the proxy determines the distribution route of the request based on the request type and context.

Figure 2: Read-write separation architecture with proxy

Next, let’s take a look at the characteristics of direct client connection and read-write separation architecture with a proxy.

In the direct client connection scheme, because there is no proxy forwarding layer, the query performance is slightly better, the overall architecture is simpler, and it is easier to troubleshoot. However, in this scheme, the client needs to understand the backend deployment details. Therefore, when there is a master-slave switch or database migration, the client will be aware of it and needs to adjust the database connection information. You may think that this is too troublesome for the client, with a lot of redundant information, and an ugly architecture. But in fact, in most cases where this architecture is used, there will be a component responsible for managing the backend, such as Zookeeper, allowing the business side to focus only on business logic development.
The architecture with a proxy is more user-friendly for the client. The client does not need to pay attention to the backend details, and the connection maintenance and backend information maintenance are all done by the proxy. However, in this case, higher requirements are placed on the backend maintenance team. Moreover, the proxy also needs to have a highly available architecture. Therefore, the overall complexity of the architecture with a proxy is relatively higher.

After understanding the advantages and disadvantages of these two schemes, the specific choice depends on the capabilities provided by the database team. However, currently, the trend is moving towards the direction of the architecture with a proxy.

However, regardless of which architecture is used, you will encounter the problem we are going to discuss today: due to the possible delay between the master and the slave, if the client performs a query immediately after completing an update transaction, and the query selects the slave, it may read the state just before the transaction update.

In this article, we will refer to this phenomenon of “reading an outdated state on the slave” as “stale read”.

Earlier, we mentioned several possible reasons for master-slave delay and corresponding optimization strategies, but master-slave delay cannot be completely avoided.

In any architecture, the client expects the data results from the slave to be the same as those from the master.

Next, let’s discuss how to handle the issue of stale reads.

Here, I will summarize the solutions mentioned in the article to help you better understand and grasp the knowledge in the full text. These solutions include:

Solution of forcing access to the master database;
Sleep solution;
Solution of judging whether there is delay between the master and the slave;
Solution with semi-sync coordination;
Waiting for the primary location plan
Waiting for the GTID plan

Forced to use the primary database plan #

The plan of forcing the use of the primary database is actually categorizing the query requests. In general, we can divide the query requests into the following two categories:

For the requests that must obtain the latest results, they are forced to be sent to the primary database. For example, on a trading platform, after the seller publishes a product, they immediately return to the main page to see if the product was successfully published. In this case, this request needs to obtain the latest results, so it must be sent to the primary database.
For the requests that can read old data, they are sent to the secondary database. On this trading platform, even if the buyer sees the latest published product a few seconds later, it is acceptable. Therefore, these requests can be sent to the secondary database.

You may think that this plan is a bit timid and opportunistic, but in fact, this plan is the most commonly used.

Of course, the biggest problem with this plan is that sometimes you may encounter a requirement where “no query can be an expired read,” such as some financial business. In this case, you have to give up read-write separation, and all read and write pressures are on the primary database, which is equivalent to giving up scalability.

Therefore, the topic we are going to discuss next is: in scenarios that support read-write separation, what are the solutions for handling expired reads, and analyze the advantages and disadvantages of each solution.

Sleep plan #

After the primary database is updated, sleep for a while before reading from the secondary database. The specific plan is to execute a command similar to “select sleep(1)”.

The assumption of this plan is that in most cases, the delay between the primary and backup databases is within 1 second, so by sleeping for a while, there is a high probability of obtaining the latest data.

Your first impression of this plan is likely to be unreliable, and you may say that executing a sleep statement before initiating a query is not user-friendly.

However, this idea can indeed solve the problem to some extent. In order to make it more reliable, we can try a different approach.

Taking the example of a seller publishing a product, after the product is published, use Ajax (Asynchronous JavaScript + XML) to directly display the client’s input content as a “new product” on the page, instead of actually querying the database.

In this way, the seller can confirm that the product has been successfully published through this display. When the seller refreshes the page to view the product, some time has passed, which achieves the purpose of waiting for a while and solves the problem of expired reads.

In other words, this sleep plan indeed solves the problem of expired reads in similar scenarios. However, strictly speaking, the problem with this plan is its inaccuracy. This inaccuracy has two aspects:

If the query request could originally obtain the correct result on the secondary database in 0.5 seconds, it will still wait for 1 second.
If there is a delay exceeding 1 second, expired reads will still occur.

When you read up to here, you may have a feeling of “are you kidding me?” This improved plan can indeed solve the problem of expired reads in scenarios similar to Ajax, but it still doesn’t seem reliable in any way. Don’t worry, next I will introduce to you some more accurate plans.

Plan for judging whether the primary and backup databases have no delay #

To ensure no delay in the standby database, there are generally three methods.

In the [previous article], we learned that the value of the seconds_behind_master parameter in the show slave status result can be used to measure the length of the delay between the primary and standby databases.

Therefore, the first method to ensure no delay between the primary and standby databases is to check whether seconds_behind_master is already equal to 0 before executing a query request on the standby database. If it is not equal to 0, the query request must wait until this parameter becomes 0 before it can be executed.

The unit of seconds_behind_master is seconds. If you feel that the precision is not enough, you can also use the method of comparing the positions and GTIDs to ensure no delay between the primary and standby databases, which are the second and third methods we will discuss next.

As shown in Figure 3, it is a partial screenshot of the show slave status result.

Figure 3. show slave status result

Now, based on this result, let’s see specifically how to ensure no delay between the primary and standby databases by comparing positions and GTIDs.

The second method is to compare positions to ensure no delay between the primary and standby databases:

Master_Log_File and Read_Master_Log_Pos indicate the latest position read from the primary database;
Relay_Master_Log_File and Exec_Master_Log_Pos indicate the latest position executed by the standby database.

If the two pairs of values Master_Log_File and Relay_Master_Log_File, Read_Master_Log_Pos and Exec_Master_Log_Pos are completely the same, it means that the received logs have been synchronized.

The third method is to compare the GTID sets to ensure no delay between the primary and standby databases:

Auto_Position=1 indicates that the GTID protocol is used in the primary-standby relationship.
Retrieved_Gtid_Set is the GTID set of all the logs received by the standby database;
Executed_Gtid_Set is the GTID set of all the logs already executed by the standby database.

If these two sets are the same, it also means that all the logs received by the standby database have been synchronized.

As can be seen, these two methods of comparing positions and comparing GTIDs are more accurate than checking whether seconds_behind_master is 0.

The method of checking whether the standby database has finished synchronizing before executing a query request, compared to the sleep method, indeed improves the accuracy, but it still does not reach the “accurate” level. Why is that?

Now let’s review the state of binlog in the primary and standby databases:

The primary database executes and writes the binlog, and returns it to the client.
The binlog is sent from the primary database to the standby database, which receives it.
The binlog is executed in the standby database.

The logic we used to determine no delay between the primary and standby databases is that “the logs received by the standby database have all been executed”. However, from the analysis of the state of binlog between the primary and standby databases, it can be seen that there is still a part of the logs in a state where the client has received the commit confirmation, but the standby database has not received the logs.

This is shown in Figure 4.

Figure 4. The standby database has not received trx3

At this time, the primary database has completed the execution of three transactions, trx1, trx2, and trx3. Among them:

Trx1 and trx2 have been sent to the standby database and have been executed.
Trx3 has been executed by the primary database and has been replied to the client, but it has not been sent to the standby database.

If you execute a query request on standby database B at this time, according to the logic above, the standby database will think that there is no synchronization delay, but trx3 cannot be queried. Strictly speaking, this is an example of a stale read.

So, is there a solution to this problem?

Cooperate with semi-sync #

To solve this problem, semi-synchronous replication, or semi-sync replication, should be introduced.

Semi-sync has the following design:

When a transaction is committed, the primary database sends the binlog to the standby database.
After the standby database receives the binlog, it sends an ack back to the primary database to indicate that it has received it.
Only after the primary database receives this ack can it return a confirmation to the client that the transaction is complete.

In other words, if semi-sync is enabled, it means that all transactions that have sent a confirmation to the client have been guaranteed that the standby database has received the log.

In the comment section of the [previous article], a student asked: If the primary database loses power and some binlogs are not sent to the standby database in time, will it cause data loss?

The answer is, if using regular asynchronous replication, data loss is possible, but semi-sync can solve this problem.

In this way, by using semi-sync in combination with the previous positioning judgment, query requests executed on the standby database can avoid stale reads. However, the semi-sync + position-based approach only works for the one-master-one-replica scenario. In a one-master-multiple-replica scenario, the master only needs to wait for an acknowledgment from one replica before returning the confirmation to the client. At this point, there are two situations when executing a query request on a replica:

If the query is executed on the replica that has acknowledged, it can ensure that it reads the latest data.
However, if the query is executed on other replicas, which may not have received the latest logs, it can result in stale reads.

In fact, there is another potential issue with the position-based approach for determining synchronization points. If the master’s position or GTID set updates quickly during peak business updates, the equivalence check of the above two positions will not be satisfied, and the replicas may not be able to respond to query requests for a long time.

In fact, going back to our initial business logic, when we initiate a query request, we don’t actually need to wait for “complete synchronization between the master and replica”.

Why is that? Let’s take a look at this timeline.

Figure 5. Continuous delay of one transaction on the master and replica

Figure 5 shows a bad case for the wait position approach. The dashed boxes under replica B in the figure represent the transactions in both the relay log and binlog. As we can see, from state 1 to state 4 in Figure 5, it always stays in a state of delaying one transaction.

Replica B is continuously delayed until state 4 and is lagging behind the master A. If we use the approach that must wait until there is no delay, the select statement cannot be executed until state 4.

However, in reality, the client initiates the select statement after updating trx1, and we only need to ensure that trx1 has been executed before executing the select statement. In other words, if the select statement is executed at state 3, it will return the expected result.

To summarize, the semi-sync combined with the check for no delay on the master and replica has two problems:

In a one-master-multiple-replica scenario, there may be stale reads when executing query requests on certain replicas.
In the case of continuous delay, it may result in excessive waiting.

Next, I will introduce the approach of waiting for the master’s position, which can solve these two problems.

Wait for Master’s Position Approach #

To understand the approach of waiting for the master’s position, I need to introduce a command:

select master_pos_wait(file, pos[, timeout]);

The logic of this command is as follows:

It is executed on the replica.
The parameters file and pos refer to the filename and position on the master.
timeout is optional, and setting it to a positive integer N means that the function waits for a maximum of N seconds.

The normal result returned by this command is a positive integer M, which indicates the number of transactions executed from the start of the command until reaching the binlog position represented by file and pos.

Of course, in addition to returning a normal positive integer M, this command also returns some other results, including:

If an exception occurs in the replica’s replica thread during execution, NULL is returned.
If the wait exceeds N seconds, -1 is returned.
If it is found at the beginning of execution that this position has already been executed, 0 is returned.

To ensure that we can get the correct data when executing a select query after updating trx1 in the logic illustrated in Figure 5, we can use the following steps:

After completing the trx1 transaction update, immediately execute show master status to get the current File and Position executed on the master.
Choose a replica to execute the select query.
Execute select master_pos_wait(File, Position, 1) on the replica.
If the return value is a non-negative integer, execute the select query on this replica.
Otherwise, execute the select query on the master.

I have illustrated this flowchart above.

Figure 6. Approach using master_pos_wait

Here, let’s assume that this select query can wait for a maximum of 1 second on the replica. So, if the value returned by master_pos_wait is a non-negative integer (indicating that the waiting is within 1 second), it ensures that the query executed on the replica contains the data of trx1.

Step 5, executing the query on the master, is a common fallback mechanism in this type of approach. Because the replica’s delay time is not controllable and cannot wait indefinitely, if the wait times out, it should be abandoned and the query should be executed on the master.

You may say that if all the replicas are delayed for more than 1 second, won’t all the query pressure be on the master? It is indeed the case.

However, according to our requirement of not allowing stale reads, there are only two choices: either timeout and give up, or switch to querying the master. The specific choice depends on the business development team to implement a good flow control strategy.

GTID Approach #

If your database is in GTID mode, there is also an approach to wait for GTID.

MySQL also provides a similar command: (command translation missing)

select wait_for_executed_gtid_set(gtid_set, 1);

The logic of this command is:

Wait until the transaction executed by the database contains the passed gtid_set, and return 0;
Return 1 if timed out.

In the previous wait position scheme, after executing the transaction, we still need to actively execute show master status on the primary database. Starting from MySQL 5.7.6, it is allowed to return the GTID of the transaction to the client after completing the update transaction, thus reducing one query in the scheme waiting for GTID.

At this time, the execution flow of waiting for GTID becomes:

After the trx1 transaction is completed, obtain the GTID of this transaction directly from the returned package, denoted as gtid1;
Select a slave to execute the query statement;
Execute select wait_for_executed_gtid_set(gtid1, 1) on the slave database;
If the return value is 0, execute the query statement on this slave database;
Otherwise, execute the query statement on the primary database.

Similarly to the wait master position scheme, whether to directly query the primary database after a timeout needs to be considered for flow control by the business development team.

I have drawn a flowchart for this process.

Figure 7 wait_for_executed_gtid_set scheme

In the first step above, after the trx1 transaction is completed, obtain the GTID of this transaction directly from the returned package. The question is, how can MySQL return the GTID in the returned package after executing the transaction?

You only need to set the parameter session_track_gtids to OWN_GTID, and then parse the GTID value from the returned package using the API function mysql_session_track_get_first.

In the first article of this column, when I introduced mysql_reset_connection, a student in the comments asked how to use this type of interface.

Here, I will answer again. In fact, MySQL does not provide SQL usage for this type of interface; it is provided for programming APIs (https://dev.mysql.com/doc/refman/5.7/en/c-api-functions.html).

For example, to make the client display the returned GTID after the transaction is committed, I made some modifications to the MySQL client code as shown below:

Figure 8 Display GTID of update transaction - code

In this way, you can see that after the statement is executed, the value of GTID is displayed.

Figure 9 Display GTID of update transaction - effect

Of course, this is just an example. When you want to use this scheme, you should call the function mysql_session_track_get_first in your client code.

Summary #

In today’s article, I introduced to you the reasons for possible stale reads in the case of read-write separation with one primary and multiple replicas, as well as several solutions.

Some of these schemes may seem compromised, and some may not seem reliable, but they all have practical application scenarios, so you need to choose according to business requirements.

Even the last two schemes of waiting for position and waiting for GTID, although they seem more reliable, still require trade-offs. If all the replicas are delayed, will the requests all fall on the primary database due to sudden increase in load, causing the primary database to crash?

In fact, in practical applications, these schemes can be mixed.

For example, first classify the requests on the client side, distinguish which requests can accept stale reads and which requests cannot; then use the wait GTID or wait position scheme for statements that cannot accept stale reads.

However, in essence, stale reads are caused by one write and multiple reads. In practical applications, there may be other database schemes that do not require waiting and can be horizontally scaled, but this is often achieved by sacrificing write performance, that is, a trade-off between read and write performance.

Finally, let me leave you with a question.

Assume that your system adopts the wait GTID scheme we introduced in this article. Now suppose you need to perform a DDL on a large table on the primary database, what might happen? How would you avoid this situation?

You can write your analysis and solution design in the comments, and I will discuss this question with you in the next article. Thank you for reading, and feel free to share this article with more friends.

Answer to the Previous Question #

The question I posed last time was, in GTID mode, what should be done if a new replica connects to the primary database but the required binlog is no longer available?

@某、人 provided a detailed analysis in response, and I have made some slight modifications to the answer.

If the business allows for inconsistencies between the primary and replica, you can execute show global variables like 'gtid_purged' on the primary database to get the GTID set that has been deleted from the primary database, let’s call it gtid_purged1. Then, execute reset master on the replica, followed by set global gtid_purged = 'gtid_purged1', and finally execute start slave. The replica will start synchronizing from the binlog that still exists on the primary database. The missing part of the binlog on the replica may result in data loss and inconsistent primary and replica.
If data consistency is required between the primary and replica, it is best to rebuild the replica.
If another replica has a complete set of binlogs, you can first connect the new replica to this replica that has preserved the complete set of binlogs. After catching up with the log, if necessary, connect it back to the primary database.
If the binlog is backed up, you can apply the missing binlog on the replica first, and then execute start slave.