25 How MySQL Ensures High Availability #

In the previous article, I introduced the basic concept of binlog. In a master-slave relationship, each slave receives the binlog from the master and executes it.

Under normal circumstances, as long as all binlogs generated by updates executed on the master can be transmitted to the slave and executed correctly, the slave can reach a consistent state with the master. This is called eventual consistency.

However, eventual consistency is not enough for MySQL to provide high availability. Why is this? Today, I will analyze this importance with you.

Here, let me present again the diagram of the master-slave switch process in the double M structure mentioned in the previous article.

Figure 1 Master-Slave Switch Process - Double M Structure

Replication Delay #

Master-slave switch may be an active operation, such as software upgrade or planned downtime of the machine where the master is located. It can also be a passive operation, such as a power outage on the machine where the master is located.

Next, let’s take a look at the scenario of active switch.

Before explaining the detailed steps of the active switch process, I need to explain a concept to you, that is “synchronous delay”. The time points related to data synchronization mainly include the following:

Master server A completes the execution of a transaction and writes it into the binlog. We will refer to this moment as T1;
Then it is passed to the slave server B. We will refer to the moment when slave server B receives this binlog as T2;
Slave server B completes the execution of this transaction. We will refer to this moment as T3.

The so-called replication delay refers to the time difference between the completion of execution on the slave and the completion of execution on the master for the same transaction, that is, T3-T1.

You can execute the “show slave status” command on the slave server, and its return result will display “seconds_behind_master”, which indicates the number of seconds the slave server is delayed.

The calculation method for “seconds_behind_master” is as follows:

Each binlog of each transaction contains a timestamp field, which records the time when it was written on the master server;
The slave server extracts the value of the timestamp field for the currently executing transaction and calculates the difference between it and the current system time to obtain “seconds_behind_master”.

Observing this, we can see that “seconds_behind_master” parameter actually calculates T3-T1. Therefore, we can use “seconds_behind_master” as the value of the replication delay, and the time precision of this value is in seconds.

You may ask whether if the system time of the master and slave servers is not synchronized, will it cause an inaccurate value for the replication delay?

Actually, it won’t. When the slave server connects to the master server, it acquires the current system time of the master server by executing the “SELECT UNIX_TIMESTAMP()” function. If at this time it discovers that the system time of the master server is different from its own, the slave server will automatically subtract this difference when calculating seconds_behind_master.

It should be noted that when the network is normal, the time required for the log to be transferred from the master to the slave is very short, that is, the value of T2-T1 is very small. This means that under normal network conditions, the main source of replication delay is the time difference between the slave server receiving the binlog and completing the execution of the transaction.

Therefore, the most direct manifestation of replication delay is that the speed at which the slave consumes relay logs is slower than the speed at which the master generates binlogs. Next, I will analyze with you the possible reasons for this.

Sources of Replication Delay #

First of all, under certain deployment conditions, the machine where the slave is located may have worse performance than the machine where the master is located.

In general, the idea behind deploying it in this way is that since there are no requests on the slave server, a slightly inferior machine can be used. Or, for example, they put 20 primary servers on 4 machines and concentrate the slave servers on one machine.

In fact, we all know that update requests put equal pressure on IOPS on both the master and slave servers. Therefore, when deploying in this way, the slave server is generally set in a “not double 1” mode.

However, in practice, the update process can also trigger a large number of read operations. So, when multiple slave servers on the host machine are competing for resources, it may lead to replication delays.

Of course, this type of deployment is rare now. Because the master-slave relationship may switch, and the slave server may become the master server at any time, it is now more common to use machines of the same specifications for the master and slave servers and deploy them symmetrically.

Follow-up question: However, even with symmetric deployment, replication delay may still occur. Why is that?

This is the second common possibility, which is that the slave server is under high pressure. The general idea is that since the master server provides writing capacity, the slave server can provide some reading capacity. Or some analysis statements required for operational backends cannot affect normal business and can only run on the slave server.

I have truly seen many cases like this. Because the master server directly affects the business, users tend to be more restrained when using it, but they neglect the pressure control on the slave server. As a result, the queries on the slave server consume a large amount of CPU resources, affecting the synchronization speed and resulting in replication delay.

In this case, we generally proceed as follows:

One master and multiple slaves. In addition to the backup, you can have multiple slave databases to share the reading pressure.
Output the binlog to an external system, such as Hadoop, to enable the external system to provide statistical query capabilities.

Among them, the method of one master and multiple slaves is often used. Because as a database system, it must also ensure the ability to perform regular full backups. The slave databases are well suited for this purpose.

Note: Here I need to explain that the concepts of slave databases and backup databases are actually very similar. In this column, for convenience of description, I refer to the database that will be selected as the new master database during the HA process as a backup database, and the others are referred to as slave databases.

Continued Question 2: By using one master and multiple slaves to ensure that the backup database is not under more pressure than the master database, are there any other situations that can cause the delay between the master and the backup?

This brings us to the third possibility, which is a large transaction.

Large transactions are easy to understand. Because the transaction must be completed and written to the binlog on the master database before it can be transmitted to the slave database. Therefore, if a statement on the master database takes 10 minutes to execute, this transaction is likely to cause a delay of 10 minutes on the slave database.

I don’t know if your company’s DBA has told you not to delete too much data at once using a delete statement. In fact, this is a typical large transaction scenario.

For example, for some archival data, if you don’t pay attention to deleting historical data during normal operation and wait until the space is almost full, the developers will need to delete a large amount of historical data at once. At the same time, because they want to avoid affecting the business during peak hours (at least having this awareness is very good), they will execute these large data deletion operations at night.

As a result, the responsible DBA receives a delay alert in the middle of the night. Then, the DBA team requests that when you continue to delete data in the future, you need to control the amount of data deleted per transaction and divide it into multiple deletions.

Another typical scenario for large transactions is a large table DDL. I have mentioned this scenario in a previous article. The solution is to use the gh-ost solution for planned DDL (you can review the relevant content in the 13th article [“Why Doesn’t Table File Size Change When Half of the Table Data is Deleted?”]).

Continued Question 3: If there are no large transactions on the master database, what other factors can cause delays between the master and the backup?

Another major reason for the delay between the master and the backup is the parallel replication capability of the backup database. I will discuss this topic in detail in the next article.

In fact, there are quite a few other situations that can cause delays between the master and the backup. If you have encountered other scenarios, feel free to leave me a comment in the comment section, and we can analyze and discuss them together.

Because of the existence of delay between the master and the backup, different strategies are used during the master-slave switch.

Reliability-first Strategy #

In the dual M structure shown in Figure 1, the detailed process of switching from state 1 to state 2 is as follows:

Check the value of seconds_behind_master on the backup database B. If it is less than a certain value (for example, 5 seconds), proceed to the next step; otherwise, continue to retry this step.
Set the master database A to read-only state by setting readonly to true.
Check the value of seconds_behind_master on the backup database B until this value becomes 0.
Set the backup database B to read-write state by setting readonly to false.
Switch the business requests to the backup database B.

This switch process is usually completed by a dedicated HA system, which we temporarily refer to as the reliability-first process.

Figure 2 MySQL reliability-first master-slave switch process

Note: SBM is an abbreviation for the seconds_behind_master parameter.

As you can see, there is downtime in this switch process. This is because after step 2, both the master database A and the backup database B are in read-only state, which means that the system is in an un-writable state until step 5 is completed.

In this un-writable state, step 3 may take several seconds, which is why it is necessary to do the check in step 1 first to ensure that the value of seconds_behind_master is small enough.

Imagine if the delay between the master and the backup is already 30 minutes, and without checking, the switch is made directly, then the system will be unavailable for 30 minutes, which is generally unacceptable for businesses.

Of course, the system’s downtime is determined by this data reliability-first strategy. You can also choose an availability-first strategy to reduce the downtime to almost 0.

Availability-first Strategy #

If I forcefully adjust steps 4 and 5 to be executed at the beginning, that is, without waiting for data synchronization between the master and the backup, directly switch the connection to the backup database B, and allow the backup database B to be read-write, then the system will have almost no downtime.

We temporarily refer to this switch process as the availability-first process. The cost of this switch process is the possibility of data inconsistency.

Next, I will share an example of data inconsistency caused by the availability-first process. Let’s assume there is a table t:

mysql> CREATE TABLE `t` (

id int(11) unsigned NOT NULL AUTO_INCREMENT,
c int(11) unsigned DEFAULT NULL,
PRIMARY KEY (id)
) ENGINE=InnoDB;

insert into t(c) values(1),(2),(3);

This table defines an auto-increment primary key id. After initializing the data, there are 3 rows of data on both the primary and standby databases. Next, the business user wants to execute two insert statements on table t in succession:

insert into t(c) values(4);
insert into t(c) values(5);

Assuming that the main database has a large number of updates to other tables, resulting in a 5-second delay in replication to the standby database. After inserting a statement with c=4, a main-standby switch is initiated.

Figure 3 depicts the switch process and data results when the availability-first strategy and binlog_format=mixed are used.

Figure 3 Availability-First Strategy with binlog_format=mixed

Now, let’s analyze the switch process together:

In step 2, the main database A executes the insert statement, inserting one row of data (4,4), and then starts the main-standby switch.
In step 3, because there is a 5-second delay between the main and standby databases, standby database B has not yet had the chance to apply the intermediate log “insert c=4” and is already receiving the command from the client to “insert c=5.”
In step 4, standby database B inserts one row of data (4,5) and sends this binlog to main database A.
In step 5, standby database B executes the intermediate log “insert c=4” and inserts one row of data (5,4). The statement “insert c=5” that was executed directly on standby database B is sent to main database A and inserts one new row of data (5,5).

The end result is that there are two rows of inconsistent data on main database A and standby database B. It can be seen that this data inconsistency is caused by the availability-first process.

So, what happens if I still use the availability-first strategy but set binlog_format=row?

Because the row format records the values of all fields of the newly inserted row in the binlog, there will only be one row of inconsistent data in the end. Moreover, the application thread for both the main and standby databases will report a duplicate key error and stop. In other words, neither the (5,4) on standby database B nor the (5,5) on main database A will be executed by the other.

In Figure 4, I have depicted the detailed process, and you can analyze it further.

Figure 4 Availability-First Strategy with binlog_format=row

From the above analysis, you can draw some conclusions:

When using the row format for binlog, data inconsistency issues are easier to discover. When using the mixed or statement format for binlog, data inconsistency can often go unnoticed. If you discover data inconsistency after a long time, it is likely that the data inconsistency is no longer traceable, or it has led to even more logical inconsistencies in the data.
The availability-first strategy for main-standby switching can lead to data inconsistency. Therefore, in most cases, I recommend using the reliability-first strategy. After all, for data services, data reliability generally takes precedence over availability.

However, nothing is absolute. Is there a scenario where data availability is higher than reliability?

The answer is yes.

I encountered a scenario like this before:

The purpose of a database is to record operational logs. In this case, if data inconsistency can be repaired through binlog, and this temporary inconsistency does not cause any business issues.
At the same time, the business system relies on the logic of writing to this log. If this database cannot be written to, it will cause the business operations online to be unable to execute.

In this case, you may need to choose to forcefully switch first and then repair the data afterward.

Of course, during the retrospective, we realized an improvement measure, which is to not make the business logic depend on the writing of this log. That is to say, the logic module of writing to the log should be able to degrade, such as writing to a local file or another temporary database.

In this case, the reliability-first strategy can be used again.

Next, let’s take a look at what the effects of an exceptional switch would be according to the reliability-first approach. Assuming the master-slave delay between primary database A and secondary database B is 30 minutes. At this point, if primary database A fails, the HA (High Availability) system needs to switch B as the primary database. When we perform the active switch, we can wait until the master-slave delay is less than 5 seconds before starting the switch. However, by that time, there are no other options.

Figure 5: Reliability-first strategy, primary database is unavailable

If we adopt the reliability-first strategy, we must wait until the seconds_behind_master of secondary database B is 0 before switching. But the current situation is more severe than before, as it is not just a problem of read-only and non-writable system, but the system is completely unavailable. This is because after primary database A fails, our connection has not switched to secondary database B.

You may ask, can’t we directly switch to secondary database B and keep it read-only?

That won’t work either.

During this time, the relay log has not been fully applied. If we directly initiate the primary-secondary switch, the client queries will not see the previously executed transactions, leading to “data loss”.

Although these data will be recovered as the relay log continues to be applied, for some businesses, it is unacceptable to query a “temporary data loss” state.

By now, you should know that the availability of a MySQL high availability system depends on the master-slave delay, while ensuring data reliability. The shorter the delay, the shorter the time required for service recovery and the higher the availability in the event of a primary database failure.

Summary #

In this article, I first introduced the basics of the MySQL high availability system, which is the logic behind primary-secondary switching. Then, I discussed several situations that can cause master-slave delay and their corresponding improvement directions.

Then, due to the existence of master-slave delay, there are different choices for the switch strategy. Therefore, I analyzed the differences between the reliability-first and availability-first strategies.

In practical applications, I recommend using the reliability-first strategy. After all, ensuring data accuracy should be the bottom line of a database service. On this basis, system availability can be improved by reducing the master-slave delay.

Finally, I’ll leave you with a question.

Generally, database monitoring systems now have slave delay monitoring, which is to execute “show slave status” on the secondary database to collect the value of seconds_behind_master.

Suppose you see a delay monitoring graph of a secondary database that you maintain, similar to Figure 6, which is a line segment sloping upwards at a 45° angle. What do you think could be the reason for this? How would you confirm this reason?

Figure 6: Slave delay

You can write your analysis in the comments, and I will discuss this question with you at the end of the next article. Thank you for listening, and feel free to share this article with more friends for reading.

Answer to the previous question #

The question I left you with last time was about the circumstances under which a bidirectional replication (double M) structure would encounter circular replication.

One scenario is when an update transaction is executed on the primary database, and then the command “set global server_id=x” is used to modify the server_id. When the log is sent back, it is found that the server_id in the log is different from its own server_id, so it can only be executed.

Another scenario is when there are three nodes, as shown in Figure 7. trx1 is executed on node B, so the server_id in the binlog is B. The binlog is then sent to node A, and then A and A’ establish a bidirectional replication (double M) structure, resulting in circular replication.

Figure 7: Circular replication in three-node structure

This type of circular replication can occur during database migration.

If circular replication occurs, you can execute the following commands on node A or A':

stop slave;
CHANGE MASTER TO IGNORE_SERVER_IDS=(server_id_of_B);
start slave;

This way, this node will no longer execute the log it receives. After a period of time, execute the following command to change this value back:

stop slave;
CHANGE MASTER TO IGNORE_SERVER_IDS=();
start slave;