Before updating this Q&A article today, the MySQL 实战 section has already published 14 articles. In these articles, many viewers have left high-quality comments in the comment section. Now, for each article, there are enthusiastic students summarizing the key points, as well as many students raising high-quality questions and helping answer other students’ questions.

As I browse through these comments and respond to them, I am greatly encouraged and do my best to help you solve problems and engage in discussions. It can be said that your comments have livened up the atmosphere of the entire series and raised its overall quality. Thank you for that.

I have directly replied to most of the comments in the comment section. For questions that require further explanation, I have written them down in my notebook. These questions will serve as the material for today’s Q&A article.

So far, I have collected 47 questions, and it’s difficult to cover them all in today’s article. Therefore, I have selected a few closely related questions and combined them to hopefully help address some of your doubts regarding logs and indexes. We will gradually address the other questions later on.

Questions regarding Logs #

In the second article, “日志系统：一条 SQL 更新语句是如何执行的？” (The Log System: How is an SQL Update Statement Executed?), I used a proof by contradiction to explain how binlog (archive log) and redo log (redo log) work together for crash recovery. I explained that without two-phase commit, MySQL can experience problems such as inconsistent data between the primary and backup.

In the comments below that article, many students asked how MySQL ensures the data integrity if an abnormal restart occurs at different moments during the two-phase commit.

Now, let’s start with this question.

I’ll show you the two-phase commit diagram once again to help you understand the following content.

Figure 1: Two-Phase Commit Diagram

First, let me clarify a misunderstanding. Some students asked why this diagram includes a “commit” statement when it’s supposed to be the execution flow of an update statement.

The reason for this confusion is the confusion between two concepts of “commit”:

When they mentioned the “commit statement,” they are referring to the command in the MySQL syntax used to submit a transaction. It is typically used in conjunction with “begin/start transaction.”
On the other hand, the “commit step” we mentioned in the diagram refers to a small step in the transaction commit process, and it’s the final step. When this step is completed, the transaction is successfully committed.
The “commit statement” includes the “commit step.”

In our example, since we didn’t explicitly initiate a transaction, this update statement itself is a transaction. When it finishes execution and tries to commit the transaction, it will involve the “commit step.”

Now, let’s analyze what happens when MySQL experiences an abnormal restart at different moments during the two-phase commit.

If a crash occurs at moment A in the diagram, which is after the redo log is written in the prepare phase and before the binlog is written, the transaction will be rolled back during crash recovery. At this point, the binlog is not yet written, so it won’t be transmitted to the backup. I believe everyone can understand this.

The main confusion arises in moment B, which is when the binlog is written but the redo log is not yet committed, and a crash occurs. What will happen during crash recovery in MySQL?

Now, let’s take a look at the rules for crash recovery.

If the transaction in the redo log is complete, meaning it has a commit flag, it will be directly committed.
If the redo log only contains complete prepare entries, it will check whether the corresponding transaction binlog exists and is complete: a. If it is, the transaction will be committed; b. Otherwise, the transaction will be rolled back.

In the case of a crash at moment B, it corresponds to situation 2(a), and the transaction will be committed during crash recovery.

Now, let’s further explore this question.

Follow-up Question 1: How does MySQL know if the binlog is complete? #

Answer: The binlog of a transaction has a complete format:

For binlogs in statement format, there will be a COMMIT statement at the end;
For binlogs in row format, there will be an XID event at the end.

In addition, starting from MySQL version 5.6.2, the binlog-checksum parameter was introduced to verify the correctness of the binlog content. In case there is an error in the middle of the binlog due to disk issues, MySQL can detect it by checking the checksum result. Therefore, MySQL has a way to verify the completeness of the transaction binlog.

Follow-up 2: How are redo log and binlog associated? #

Answer: They have a common data field called XID. During crash recovery, the redo log is scanned in order:

If a redo log is encountered with both prepare and commit, it is directly committed;
If a redo log is encountered with only prepare and no commit, the corresponding transaction is searched in the binlog using the XID.

Follow-up 3: Why did MySQL design the redo log in the prepare stage along with the complete binlog, so that it can be restored after a restart? #

Answer: Actually, this question is related to the consistency of data and backups that we mentioned in the proof by contradiction. At time B, after the binlog is written, if MySQL crashes, the binlog has already been written and will be used by the slave database (or a database restored from this binlog).

Therefore, the transaction also needs to be committed on the master database. By adopting this strategy, the data consistency between the master and slave databases is ensured.

Follow-up Question 4: If that’s the case, why bother with two-phase commit? Just finish writing the redo log first and then write the binlog. When recovering from a crash, both logs must be complete. Isn’t the logic the same? #

Answer: In fact, two-phase commit is a classic problem in distributed systems and is not unique to MySQL.

If I have to give an example to explain the necessity of doing this, it would be the problem of transaction durability.

For the InnoDB engine, once the redo log is committed, the transaction cannot be rolled back (if it is still allowed to roll back, it may overwrite the updates of other transactions). However, if the redo log is directly committed and then the binlog fails to be written, InnoDB cannot roll back, and the data and binlog will be inconsistent.

The purpose of two-phase commit is to give everyone a chance to say “I’m okay” and then commit together.

Follow-up Question 5: Without introducing two logs, is there a need for two-phase commit? Can we just use binlog to support crash recovery and archiving? #

Answer: The suggestion is to keep only binlog and change the commit process to: … -> “Data update to memory” -> “Write binlog” -> “Commit transaction”. Can this also provide crash recovery capabilities?

The answer is no.

If we talk about historical reasons, it is because InnoDB is not the native storage engine of MySQL. The native engine of MySQL is MyISAM, which was not designed to support crash recovery.

InnoDB was already an engine that provided crash recovery and transaction support before it became a MySQL plugin and joined the MySQL engine family.

After InnoDB was integrated into MySQL, it was found that since binlog does not have crash recovery capabilities, the existing redo log in InnoDB could be used.

And if we talk about implementation reasons, there are many. Based on the suggestion of using only binlog for crash recovery, I have drawn a diagram where there is no redo log.

Diagram 2: Crash Recovery with Only Binlog

Even in this process, binlog still cannot support crash recovery. Let me explain one aspect that it does not support: binlog does not have the ability to recover “data pages”.

In the diagram, if binlog2 has been written but the entire transaction has not been committed, and MySQL crashes, upon restart, the internal transaction 2 of the engine will be rolled back, and then binlog2 can be applied. However, for transaction 1, the system already considers it committed and will not apply binlog1 again.

However, the InnoDB engine uses Write-Ahead Logging (WAL) technology. When executing a transaction, once the memory and logs are written, the transaction is considered completed. If a crash occurs later, data page recovery relies on the logs.

In other words, if a crash occurs at the position marked in the diagram, transaction 1 may be lost, and it is a loss at the level of data pages. At this time, the binlog does not record the details of data page updates and cannot recover them.

If you suggest optimizing the content of binlog to include data page changes, can it work? But in fact, this is creating another redo log.

Therefore, at least with the current capabilities of binlog, it cannot support crash recovery.

Follow-up question 6: Can we use redo log without binlog? #

Answer: It is possible to do so only from the perspective of crash recovery. You can turn off binlog, eliminating the need for two-phase commit, while still maintaining a crash-safe system.

However, if you take a look at the usage scenarios of various companies in the industry, you will find that binlog is always enabled in production databases. This is because binlog has functionalities that cannot be replaced by redo log.

One such functionality is archiving. Redo log is a circular write, which means that when it reaches the end, it will loop back to the beginning and overwrite the existing logs. This makes it impossible to retain historical logs, and redo log cannot fulfill the archiving function.

Another reason is that the MySQL system relies on binlog. Binlog has been a core feature of MySQL since the beginning and is used in many ways. One of them is the foundation of MySQL’s high availability through binlog replication.

Many companies also have heterogeneous systems (such as some data analysis systems) that rely on consuming MySQL’s binlog to update their own data. If binlog is turned off, these downstream systems will not be able to receive updates.

In conclusion, due to the fact that many system mechanisms, including MySQL high availability, rely on binlog, it is not possible to rely solely on redo log. As you can see, the development ecology is crucial.

Follow-up Question 7: How big should the redo log be usually? #

Answer: If the redo log is too small, it will quickly fill up and then it will be necessary to forcefully flush the redo log. In that case, the Write-Ahead Logging (WAL) mechanism cannot function properly.

Therefore, if you have several TB available on the disk, it is advisable not to be too stingy and directly set the redo log to 4 files, each file being 1GB in size.

Follow-up Question 8: When a running instance is writing data to disk, is it updated from the redo log or the buffer pool? #

Answer: This is actually a very good question that involves the question of “what is inside the redo log.”

In reality, the redo log does not record the complete data of data pages, so it does not have the ability to update disk data pages on its own. Therefore, there is no situation where “the final data written to disk is updated by the redo log.”

In the case of a normal running instance, after the data page is modified, it becomes inconsistent with the disk data page, which is called a dirty page. The final data written to disk is to write the data page in memory to disk. This process has nothing to do with the redo log.
In the crash recovery scenario, if InnoDB determines that a data page may have lost updates during crash recovery, it will read it into memory and then let the redo log update the memory content. Once the update is completed, the memory page becomes a dirty page, returning to the state described in the first situation.

Follow-up 9: What is the redo log buffer? Does it modify memory first or write to redo log file first? #

Answer: We can answer both questions together.

During the update process of a transaction, the log needs to be written multiple times. For example, consider the following transaction:

begin;
insert into t1 ...
insert into t2 ...
commit;

In this transaction, records need to be inserted into two tables. During the insertion process, the generated logs need to be saved first, but they cannot be directly written to the redo log file before the transaction is committed.

Therefore, the redo log buffer is a piece of memory used to temporarily store the redo logs. This means that when the first insert statement is executed, the data in memory is modified and the redo log buffer writes the log.

However, the actual writing of the logs to the redo log files (with file names like ib_logfile+ number) is done when the commit statement is executed.

(Here, “actively flushing to disk” is not done during the transaction execution process in order to reduce unnecessary I/O consumption. However, there may be cases where “passive writes to disk” occur, such as when there is insufficient memory or when other transactions are committed. We will discuss this issue in more detail in the 22nd article “What are the methods to improve MySQL performance?”).

When executing a single update statement, InnoDB starts a transaction on its own and commits it when the statement has completed. The process is the same as above, but compressed into a single statement.

The above questions link the issues that have been raised about redo log and binlog and provide a consolidated answer. If you have any further questions, you can leave a comment to supplement them.

Business design issue #

Next, I want to share with you a question raised by @ithunter in the comments of the 8th article [“Are transactions isolated or not?”]. I think this question is interesting and practical, and other people may also encounter similar situations. So, I will answer and share it here.

Here is the question (slightly modified for better understanding):

In our business, there is a requirement that if two users A and B mutually follow each other, they become friends. The design uses two tables: “like” table and “friend” table. The “like” table has the fields user_id and liker_id, which I set as a composite unique index (uk_user_id_liker_id). The execution logic of the statement is as follows:

Taking A following B as an example:

First, check if B has followed A: select * from like where user_id = B and liker_id = A;

If yes, they become friends: insert into friend;

If no, it is a one-way following relationship: insert into like;

However, if A and B follow each other at the same time, there might be a case where they don’t become friends. This is because in step 1, neither of them has followed each other. Even if an exclusive lock is used, it won’t work because the record does not exist. So, in terms of MySQL locking, is there a way to handle this situation?

First of all, I want to commend the way this question is asked. Although the current comment section of Geek Time does not allow additional comments, if you can ask questions and explain them clearly in one message, it is actually fine. So, I hope that when you ask questions in the comment section, you can also refer to this method.

Next, I will simulate the tables mentioned by @ithunter to facilitate our discussion.

CREATE TABLE `like` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `user_id` int(11) NOT NULL,
  `liker_id` int(11) NOT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `uk_user_id_liker_id` (`user_id`,`liker_id`)
) ENGINE=InnoDB;

CREATE TABLE `friend` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `friend_1_id` int(11) NOT NULL,
  `friend_2_id` int(11) NOT NULL,
  UNIQUE KEY `uk_friend` (`friend_1_id`,`friend_2_id`),
  PRIMARY KEY (`id`)
) ENGINE=InnoDB;

Although the prompt did not mention the index structure for the friend table, I assume that there are indexes on friend_1_id and friend_2_id. For the sake of description, I added a unique index.

By the way, “like” is a keyword, and I generally do not recommend using keywords as database names, table names, column names, or index names.

Let me translate the question. In a concurrent scenario, if two people simultaneously follow each other, they may not be able to become friends.

Now, I will list the execution statements of these two transactions in the form of the moment sequence table that you are already familiar with:

Session 1	Session 2
begin;	begin;
select * from `like` where user_id = B and liker_id = A;	select * from `like` where user_id = A and liker_id = B;
if result exists, insert into friend;	if result exists, insert into friend;
else, insert into `like`;	else, insert into `like`;
commit;	commit;

Due to the absence of a initial follow relationship between A and B, the results of the select statements in both transactions are empty.

Therefore, the logic in session 1 is “since B does not follow A, just insert a one-way follow relationship”. Session 2 follows the same logic.

This result is a bug for the business because, according to the business requirements, after both logics have been executed, a record should be inserted into the friend table.

As mentioned in the question, the phrase “even if an exclusive lock is used, it won’t work because the record does not exist.” However, I have thought of another method to solve this problem.

First, add a field called relation_ship to the “like” table, set it as an integer, and allow it to take the values 1, 2, or 3.

When the value is 1, it means user_id follows liker_id.
When the value is 2, it means liker_id follows user_id.
When the value is 3, it means they follow each other.

Then, when A follows B, modify the logic as follows:

In the application code, compare the values of A and B. If A is smaller than B, execute the following logic:

mysql> begin; /* Start the transaction */
insert into `like`(user_id, liker_id, relation_ship) values(A, B, 1) on duplicate key update relation_ship=relation_ship | 1;
select relation_ship from `like` where user_id=A and liker_id=B;
/* In the code, check the returned value of relation_ship,
if it is 1, end the transaction and execute commit
if it is 3, execute the following two statements:
*/
insert ignore into friend(friend_1_id, friend_2_id) values(A,B);
commit;

If A > B, execute the following logic:

mysql> begin; /* Start the transaction */
insert into `like`(user_id, liker_id, relationship) values(B, A, 2) on duplicate key update relationship=relationship | 2;
select relationship from `like` where user_id=B and liker_id=A;
/* Conditional logic based on the value of relationship returned in the code
   If it is 2, end the transaction and execute commit
   If it is 3, execute the following two statements:
*/
insert ignore into friend(friend_1_id, friend_2_id) values(B,A);
commit;

In this design, the data in the “like” table ensures that user_id < liker_id. This guarantees that when adding a relationship in the “like” table, if the inverse relationship already exists, there will be a row lock conflict.

The insert ... on duplicate key statement ensures that within the transaction, after executing this SQL statement, it forcefully holds the row lock. This guarantees that the subsequent select statement, which checks the value of relationship, is a read operation protected by the row lock.

The “|” operator is a bitwise OR operation. Together with the “ignore” keyword in the last insert statement, it ensures the idempotency of the query in case of duplicate invocations.

In this way, even if both parties “simultaneously” follow each other, the final result in the database is that there is a record in the “like” table about A and B with a relationship value of 3. Additionally, the “friend” table will also have a record of A and B.

You might wonder why we are creating unique indexes even though we previously recommended avoiding them. In this example, the business requirement is to ensure “I will definitely insert duplicate data, and the database must have uniqueness constraints.” In such cases, there is nothing more to say, so create unique indexes.

Summary #

This is the first Q&A article of the column.

I have selected comments from the previous 14 articles in the comments section to discuss related questions on logs and indexes. Due to the limited space, I will continue the discussion in the comments section.

Finally, although this is a Q&A article, there are still questions at the end.

We have created a simple table “t” and inserted a row, then modified the row.

mysql> CREATE TABLE `t` (
`id` int(11) NOT NULL primary key auto_increment,
`a` int(11) DEFAULT NULL
) ENGINE=InnoDB;
insert into t values(1,2);

At this point, the “t” table contains a single row of data (1,2). Assume that I now execute:

mysql> update t set a=2 where id=1;

You will see the following result:

The result shows that the update matched one row (rows matched) but modified zero rows (Changed).

Based on the observations, MySQL might have the following three options when processing this command:

Updates are read-before-write. MySQL reads the data and finds that the value of “a” is already 2, so it does not update and returns directly, ending the execution.
MySQL calls the “change to (1,2)” interface provided by the InnoDB engine, but the engine finds that the value is the same as before, so it does not update and returns directly.
InnoDB executes the operation of “modifying the value to (1,2)” seriously, acquiring necessary locks and performing the update operation.

Which do you think is the actual case? Can you verify your conclusion through experiments? Furthermore, you can think about why MySQL chooses this strategy.

Please share your verification method and thoughts in the comment section. I will discuss this question with you at the end of the next article. Thank you for reading, and feel free to share this article with more people.

Previous Question #

The previous question was about using a counting table to record the total number of rows in a business table, and incrementing the count when inserting data into the business table.

The logical implementation is to start a transaction and execute two statements:

insert into the data table
update the counting table by incrementing the count

From the perspective of the system’s concurrency capacity, how should these two statements be arranged?

Here, I will directly copy the answer from @阿建 for your reference:

Considering the performance of concurrent systems, the insert operation should come first, followed by the update to the counting table. The knowledge point is in [Row Locks: How to Reduce the Impact on Performance?]. Updating the counting table involves row lock contention. By inserting first and then updating, we can minimize lock waiting time and improve concurrency.

Some users in the comments section mentioned that the update to the counting table should come later because the table might be storing the counts of multiple business tables. If we place the update to the counting table as the first statement in the transaction and multiple business tables are simultaneously inserting data, the waiting time will be longer.

The conclusion of this answer is correct, but the understanding is not entirely accurate. Even if we use a counting table to record the counts of multiple business tables, we will definitely add a unique index to the table name field. The table structure will look similar to this:

CREATE TABLE `rows_stat` (
  `table_name` varchar(64) NOT NULL,
  `row_count` int(10) unsigned NOT NULL,
  PRIMARY KEY (`table_name`)
) ENGINE=InnoDB;

When updating the counting table, we will use an index scan based on the table name and acquire locks at the row level. Meanwhile, when inserting data into different business tables, we are updating different rows, so there will be no row lock contention.

15 Q&A Article - Log and Index Related Issues #