24 How MySQL Ensures Master-Slave Consistency #

In previous articles, I have mentioned binlog multiple times. As we all know, binlog can be used for archiving and for primary-standby synchronization. But what exactly is its content? And why does executing binlog on the standby allow it to stay consistent with the primary? Today, I will formally introduce it to you.

There is no exaggeration in saying that binlog has played a crucial role in making MySQL the most popular open-source database today.

At the beginning, MySQL was favored by developers because of its easy-to-learn and convenient high availability architecture. Almost all of its high availability architectures directly rely on binlog. Although these high availability architectures have become increasingly complex, they all evolved from the most basic one primary and one standby.

In this article, I will mainly introduce the basic principles of primary-standby. By understanding the underlying design principles, you can also learn from these design ideas from a business development perspective.

Basic Principles of MySQL Primary-Standby #

Figure 1 shows the basic process of primary-standby switching.

Figure 1 Process of Primary-Standby Switching

In state 1, the client’s read and write operations directly access node A, while node B is the standby of A, only synchronizing the updates from A and executing them locally. This ensures that nodes B and A have the same data.

When a switch is needed, it transitions to state 2. At this time, the client’s read and write operations all access node B, while node A becomes the standby of B.

In state 1, although node B is not accessed directly, I still recommend setting node B (the standby) in read-only mode. There are several considerations for doing this:

Sometimes, some operational queries are placed on the standby for querying purposes, setting it to read-only can prevent accidental operations.
Prevent bugs in the switch logic, such as dual writes during the switching process, causing inconsistency between the primary and standby.
Use the read-only status to determine the role of the node.

You may wonder, if I set the standby to read-only, how can it stay synchronized with the primary?

You don’t have to worry about this issue. This is because the read-only setting is ineffective for super users, and the thread used for synchronization updates has superuser privileges.

Next, let’s take a look at the internal process of the line from node A to B. Figure 2 shows the complete flowchart of an update statement being executed on node A and synchronized to node B.

Figure 2 Flowchart of Primary-Standby Process

Figure 2 contains the details about the writing mechanism of binlog and redo log that I mentioned in the previous article. You can see that after the primary receives an update request from the client, it processes the internal transaction update logic and writes to binlog at the same time.

A long connection is maintained between standby B and primary A. Internal to primary A, there is a thread specifically for serving the long connection of standby B. The complete process of synchronizing a transaction log is as follows:

On standby B, use the “change master” command to set the IP, port, username, password of primary A, as well as the position from which to request binlog. This position includes the file name and log offset.
Execute the “start slave” command on standby B. This causes the standby to start two threads, namely the io_thread and sql_thread shown in the figure. The io_thread is responsible for establishing a connection with the primary.
After primary A verifies the username and password, it begins to read binlog from the local storage based on the position passed by standby B, and sends it to B.
After standby B receives binlog, it writes it to a local file, known as a relay log.
The sql_thread reads the relay log, parses the commands within the log, and executes them.

It should be noted that with the introduction of multi-threaded replication solutions, the sql_thread has evolved into multiple threads, which is not directly related to the principles we are discussing today, so I won’t go into details for now.

Now that we have analyzed the logic of this long connection, let’s look at another question: what exactly is the content of binlog, and why can the standby execute it directly?

Comparison of the Three Formats of binlog #

In the [15th Q&A article], I mentioned that there are two formats of binlog, statement-based and row-based. You may also come across a third format in other materials, called mixed format, which is actually a combination of the first two formats.

To describe the differences between the three formats of binlog, I created a table and initialized a few rows of data.

mysql> CREATE TABLE `t` (
  `id` int(11) NOT NULL,
  `a` int(11) DEFAULT NULL,
  `t_modified` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  PRIMARY KEY (`id`),
  KEY `a` (`a`),
  KEY `t_modified`(`t_modified`)
) ENGINE=InnoDB;

insert into t values(1,1,'2018-11-13');
insert into t values(2,2,'2018-11-12');
insert into t values(3,3,'2018-11-11');
insert into t values(4,4,'2018-11-10');
insert into t values(5,5,'2018-11-09');

If you want to delete a row of data in a table, let’s take a look at how the delete statement’s binlog is recorded.

Please note that the following statement contains comments. If you use the MySQL client to conduct this experiment, be sure to add the -c parameter, otherwise the client will automatically remove the comments.

mysql> delete from t /*comment*/  where a>=4 and t_modified<='2018-11-10' limit 1;

When binlog_format=statement, the binlog records the original text of the SQL statement. You can use the command mysql> show binlog events in 'master.000001'; to view the contents of the binlog.

Figure 3 shows an example of a binlog in statement format.

Now, let’s take a look at the output of Figure 3.

The first line SET @@SESSION.GTID_NEXT='ANONYMOUS' can be ignored for now, we will mention it in the article about master-slave switching.
The second line is a BEGIN and corresponds to the COMMIT on the fourth line, indicating that it is a transaction in between.
The third line is the actual executed statement. We can see that before the actual delete command, there is a USE 'test' command. This command is not executed by us, but is automatically added by MySQL based on the database where the table to be operated on is located. This ensures that when the log is transmitted to the backup server for execution, it can update the table t in the test database correctly regardless of the current working thread’s database. The delete statement after the USE 'test' command is the original SQL statement we inputted. We can see that the binlog “faithfully” records the SQL command, even including the comments.
The last line is a COMMIT. You can see that it contains xid=61. Do you remember what this XID is used for? If you have a fuzzy memory, you can review the relevant content in the 15th article.

To illustrate the difference between statement and row formats, let’s take a look at the execution effect of this delete command:

Figure 4 shows the warnings generated by running this delete command.

We can see that running this delete command produced a warning because the current binlog is set to statement format and the statement contains a LIMIT clause, so this command may be unsafe.

Why is that? This is because when DELETE has a LIMIT clause, it may cause data inconsistency between the master and the slave. For example, in the above example:

If the delete statement uses the index a, it will find the first row that meets the condition based on the index a, which means it will delete the row where a=4.
However, if it uses the index t_modified, it will delete the row where t_modified='2018-11-09', which means it will delete the row where a=5.

Since the statement format records the original text of the statement in the binlog, it is possible that when executing this SQL statement on the master, the index a is used, while on the slave, the index t_modified is used. Therefore, MySQL considers such writing to be risky.

So, if I change the binlog format to binlog_format='row', will there be no such problem? Let’s take a look at the content of the binlog in this case.

Figure 5 shows an example of a binlog in row format.

As we can see, compared to the binlog in statement format, the BEGIN and COMMIT before and after are the same. However, the binlog in row format does not contain the original text of the SQL statement, but replaces it with two events: Table_map and Delete_rows.

The Table_map event is used to indicate the table to be operated on next is the table t in the test database.
The Delete_rows event is used to define the deletion action.

In fact, we can’t see the detailed information through Figure 5. We still need to use the mysqlbinlog tool and the following command to parse and view the contents of the binlog. Because the information in Figure 5 shows that the binlog of this transaction starts from position 8900, we can specify the start position of the log with the --start-position parameter.

mysqlbinlog -vv data/master.000001 --start-position=8900;

Detailed Information on Example of binlog Format in row Style #

From this diagram, we can see several pieces of information:

server id 1: This indicates that the transaction was executed on the database with server_id=1.
Each event has a CRC32 value. This is because I have set the binlog_checksum parameter to CRC32.
The Table_map event is the same as what we saw in figure 5. It displays the table that will be opened next, which is mapped to the number 226. In this SQL statement, we are only operating on one table. But what if we need to work with multiple tables? Each table will have a corresponding Table_map event that will be mapped to a separate number to differentiate operations on different tables.
In our mysqlbinlog command, we used the -vv parameter to parse and display all the content. So, you can see the values of each field in the result (e.g., @1=4, @2=4, etc.).
The default configuration for binlog_row_image is FULL. Therefore, the Delete_event contains the values of all fields of the deleted rows. If binlog_row_image is set to MINIMAL, then only necessary information will be recorded. In this example, it would be only the information that id=4 was deleted.
The final Xid event is used to indicate that the transaction has been correctly committed.

As you can see from the diagram, when the binlog_format is set to row format, the binlog records the actual primary key id of the deleted row. Therefore, when the binlog is replicated to the standby database, the row with id=4 will definitely be deleted, avoiding any inconsistencies between the primary and standby databases.

Why do we have mixed binlog format? #

Based on the information above, let’s discuss the following question: Why do we have the mixed binlog format? The reasoning is as follows:

Because some statement format binlogs may cause inconsistencies between the primary and standby databases, it is necessary to use row format.
However, row format has the disadvantage of taking up more space. For example, if you delete 100,000 rows with a statement, it will only occupy a few tens of bytes of space in the binlog. But if you use row format, you need to write all these 100,000 records to the binlog. This not only takes up more space but also consumes IO resources, affecting performance.
Therefore, MySQL came up with a compromise solution, which is the mixed format of the binlog. Mixed format means that MySQL itself will determine whether a SQL statement may cause inconsistencies between the primary and standby databases. If there is a possibility, row format will be used; otherwise, statement format will be used.

In other words, the mixed format can take advantage of the benefits of the statement format while avoiding the risk of inconsistent data.

Therefore, if the binlog format of your online MySQL is set to statement, then it can be considered an unreasonable setting. At least, you should set the binlog format to mixed.

For example, in our case, when set to mixed, it will be recorded as row format. If the executed statement removes the limit 1, it will be recorded as statement format.

Of course, I have to say that now there are more and more scenarios requiring MySQL’s binlog format to be set to row. There are many reasons for this, and I’ll give you one benefit that can be directly seen: data recovery.

Next, let’s look at the issue of data recovery from the perspectives of delete, insert, and update SQL statements.

From figure 6, you can see that even if I execute a delete statement, the row format binlog will still save the complete information of the deleted rows. So, if you find that you have deleted the wrong data after executing a delete statement, you can directly convert the delete statement recorded in the binlog to an insert statement to restore the mistakenly deleted data.

If you make a mistake with an insert statement, it’s even more straightforward. In row format, the binlog of the insert statement will record all field information. This information can be used to accurately locate the inserted row. At this point, you can simply convert the insert statement to a delete statement and delete the mistakenly inserted row of data.

If an update statement is executed, the binlog will record the data of the entire row before modification and the data of the entire row after modification. Therefore, if you mistakenly execute an update statement, you only need to swap the information of these two rows before and after this event, and then execute it in the database to restore this update operation.

In fact, there are cases where it is necessary to roll back data to its state before the execution of delete, insert, or update statements. MariaDB’s Flashback tool uses the principles described above to roll back data.

Although mixed format binlogs are not widely used now, I would like to use the mixed format again to illustrate a problem. Let’s take a look at this SQL statement:

mysql> insert into t values(10,10, now());

If we set the binlog format to mixed, do you think MySQL will record it in row format or statement format?

Don’t rush to answer. Let’s see the effect of this statement executed.

Figure 7 Mixed Format and NOW()

As you can see, MySQL actually used the statement format. You must be wondering, if this binlog is transmitted to the standby server after 1 minute, won’t the data between the primary and standby be inconsistent?

Next, let’s use the mysqlbinlog tool to look at it:

Figure 8 TIMESTAMP Command

From the results in the figure, we can see that when recording the event, the binlog added an extra command: SET TIMESTAMP=1546103491. This command specifies the return time of the following now() function.

Therefore, whether this binlog is executed by the standby after 1 minute or used to restore a backup of this database after 3 days, the value of the row inserted by this insert statement remains constant. In other words, through this SET TIMESTAMP command, MySQL ensures the consistency of data between the primary and standby databases.

I have seen someone replay binlog data by parsing the log using mysqlbinlog and directly copying the statement statements to be executed. Now you know that this method is risky. This is because the execution results of some statements depend on the context commands, and the direct execution results are likely to be incorrect.

Therefore, the standard practice for data recovery using binlog is to parse it using the mysqlbinlog tool and then send the entire parsed result to MySQL for execution. Similar to the following command:

mysqlbinlog master.000001 --start-position=2738 --stop-position=2973 | mysql -h127.0.0.1 -P13000 -u$user -p$pwd;

This command means that the content between the 2738th byte and the 2973rd byte in the master.000001 file is parsed and executed by MySQL.

Circular Replication Problem #

Based on your understanding of the basic concepts of binlog in MySQL above, you now know that the characteristics of binlog ensure that executing the same binlog on the standby database can achieve the same state as the master database.

Therefore, we can assume that, under normal circumstances, the data in the primary and standby databases are consistent. In other words, the content of nodes A and B in Figure 1 is consistent. In fact, the structure I drew in Figure 1 is M-S structure, but in actual production, the more commonly used structure is dual M structure, as shown in Figure 9.

Figure 9 MySQL Dual M Structure for Primary/Secondary Switchover Process

Comparing Figure 9 with Figure 1, you can see that the only difference between the dual M structure and the M-S structure is an additional line, namely: nodes A and B are always in a master/standby relationship. In this way, there is no need to modify the master/standby relationship when switching.

However, the dual M structure also has a problem that needs to be solved.

When the business logic updates a statement on node A, it then sends the generated binlog to node B, and node B will also generate a binlog after executing this update statement (I suggest setting the log_slave_updates parameter to on, indicating that the standby database generates binlog after executing relay log).

So, if node A is also the standby database of node B, it means that the newly generated binlog of node B is executed again, and the update statement is continuously executed between nodes A and B, resulting in circular replication. How to solve this?

From Figure 6 above, you can see that MySQL records the server id of the instance where this command was first executed in binlog. Therefore, we can use the following logic to solve the circular replication problem between the two nodes:

The server ids of the two databases must be different. If they are the same, they cannot be set as master/standby relationship;
When a standby database receives binlog and replays it, it generates a new binlog with a server id that is the same as the original binlog;
Each database, after receiving the logs sent by its own master database, first checks the server id. If it is the same as its own, it means that this log was generated by itself, so it directly discards this log.

Following this logic, if we set up a dual M structure, the execution flow of logs will be as follows:

Transactions updated from node A will have server id of A recorded in the binlog;
After being executed once on node B, the binlog generated by node B will also have server id of A;
When it is sent back to node A, A will determine that this server id is the same as its own, so it will not process this log again. Therefore, the infinite loop is broken here.

Summary #

In this article, I introduced the format and basic mechanism of MySQL binlog, which serves as background knowledge for the subsequent series of articles on read-write separation and other topics. I hope you can digest and understand it carefully.

Binlog plays an important role in various high availability solutions for MySQL. What I introduced today can be considered as the foundation for all MySQL high availability solutions. Based on this, more complex solutions such as multiple nodes, semi-synchronous replication, and MySQL Group Replication have been evolved.

I also explained the advantages, disadvantages, and design considerations of different formats of MySQL binlog. I hope that when you develop systems, you can also learn from these design ideas.

Finally, let me leave you with a question.

When it comes to the circular replication problem, we mentioned that MySQL breaks the infinite loop by comparing server id. However, this mechanism is not perfect, and in some scenarios, it is still possible to encounter infinite loops.

Can you design such a scenario? And how would you solve it?

Feel free to write your design and analysis in the comments section, and I will discuss this question with you in the next article. Thank you for reading, and feel free to share this article with more people to read together.

Last Issue Question #

The question I left for you in the last article was when would you set the production database to “non-double-1”. The scenarios I currently know are as follows:

Business peak hours. Generally, if there are predictable peak hours, the DBA will have a plan to set the primary database as “non-double-1”.
Standby database delay, in order to catch up with the primary database as soon as possible. @永恒记忆 and @Second Sight mentioned this scenario.
Restoring a replica of the primary database using a backup and applying binlog. This is similar to the previous scenario.
When importing data in batches.

In general, the production database is modified to use the “non-double-1” configuration, which sets innodb_flush_logs_at_trx_commit=2 and sync_binlog=1000.