23 How MySQL Ensures Data is Not Lost #

In today’s article, I will continue to introduce methods for temporarily improving performance during peak business periods. From the title of the article “How does MySQL ensure data is not lost?”, you can see that the methods I am going to discuss today are related to data reliability.

In previous articles and Q&A sessions, I emphasized the WAL mechanism, and the conclusion is that as long as the redo log and binlog are guaranteed to be persisted to disk, data can be recovered after an abnormal restart of MySQL.

Some readers in the comment section further asked about the write process of the redo log and how to ensure that the redo log is truly written to the disk. So today, let’s take a look at the process of writing to the binlog and redo log in MySQL.

Writing Mechanism of Binlog #

In fact, the writing logic of the binlog is relatively simple: during the execution of a transaction, the log is first written to the binlog cache, and when the transaction is committed, the binlog cache is then written to the binlog file.

The binlog of a transaction cannot be split, so regardless of the size of the transaction, it must be ensured that it is written all at once. This involves the issue of saving the binlog cache.

The system allocates a memory space for the binlog cache, one for each thread, and the binlog_cache_size parameter is used to control the size of the memory occupied by the binlog cache in a single thread. If it exceeds the size specified by this parameter, it needs to be temporarily stored to disk.

When a transaction is committed, the executor writes the complete transaction in the binlog cache to the binlog, and then empties the binlog cache. The state is shown in Figure 1.

Figure 1: Status of binlog write to disk

As you can see, each thread has its own binlog cache, but they share the same binlog file.

The “write” in the figure refers to writing the log to the file system’s page cache, which does not persist the data to the disk, so it is relatively fast.
The “fsync” in the figure is the operation that persists the data to the disk. In general, we consider fsync as the operation that consumes disk IOPS.

The timing of the write and fsync operations is controlled by the sync_binlog parameter:

When sync_binlog is set to 0, it means that only the write operation is performed for each commit, without fsync.
When sync_binlog is set to 1, it means that fsync is performed for each commit.
When sync_binlog is set to N (N>1), it means that the write operation is performed for each commit, but fsync is only performed after accumulating N transactions.

Therefore, in scenarios where there is an IO bottleneck, setting sync_binlog to a relatively large value can improve performance. In actual business scenarios, considering the controllability of log loss, it is generally not recommended to set this parameter to 0. It is more common to set it to a value between 100 and 1000.

However, setting sync_binlog to N also comes with a risk: if the host experiences an abnormal restart, the binlog logs of the last N transactions will be lost.

Writing Mechanism of Redo Log #

Next, let’s talk about the writing mechanism of the redo log.

In the Q&A article of the column [part 15], I introduced the redo log buffer. During the execution of a transaction, the generated redo log needs to be written to the redo log buffer first.

Then, some readers asked, after the content of the redo log buffer is generated, does it need to be directly persisted to the disk every time?

The answer is no.

If MySQL experiences an abnormal restart during the transaction execution, then this part of the log will be lost. Since the transaction has not been committed, losing this log will not cause any loss.

So, another question is, is it possible to persist part of the logs in the redo log buffer to the disk before the transaction is committed?

The answer is yes.

To answer this question, we need to talk about the three possible states of the redo log. These three states correspond to the three colored blocks in Figure 2.

Figure 2 MySQL redo log storage status

These three states are:

Exist in the redo log buffer, physically in the MySQL process memory, which is the red part in the figure;
Written to disk (write), but not yet persisted (fsync), physically in the page cache of the file system, which is the yellow part in the figure;
Persisted to disk, corresponding to the hard disk, which is the green part in the figure.

Writing the log to the redo log buffer is fast, writing to the page cache is also fast, but persisting to disk is much slower.

In order to control the write strategy of the redo log, InnoDB provides the innodb_flush_log_at_trx_commit parameter, which can take three possible values:

When set to 0, it means that the redo log is only left in the redo log buffer every time a transaction is committed;
When set to 1, it means that the redo log is directly persisted to disk every time a transaction is committed;
When set to 2, it means that the redo log is only written to the page cache every time a transaction is committed.

InnoDB has a background thread that writes the logs in the redo log buffer to the page cache of the file system using write every 1 second, and then persists them to disk using fsync.

Note that the redo logs during the execution process of a transaction are also directly written to the redo log buffer, and these redo logs will also be persisted to disk by the background thread. In other words, the redo logs of an uncommitted transaction may already be persisted to disk.

In fact, besides the background thread’s polling operation once every second, there are two other scenarios where the redo log of an uncommitted transaction is written to disk.

One scenario is when the space occupied by the redo log buffer is about to reach half of innodb_log_buffer_size, the background thread will proactively write to disk. Note that since this transaction has not been committed, this write action is only a write, without calling fsync, so it is only left in the page cache of the file system.
Another scenario is when a parallel transaction is committed and the redo log buffer of that transaction is also persisted to disk. Assuming a transaction A has executed halfway and has written some redo logs to the buffer, at this time another thread’s transaction B is committed, if innodb_flush_log_at_trx_commit is set to 1, then according to the logic of this parameter, transaction B needs to persist all the logs in the redo log buffer to disk. At this time, the redo logs of transaction A in the redo log buffer will be persisted to disk together.

It needs to be emphasized here that when we introduced two-phase commit, we mentioned that the redo log is prepared first, then the binlog is written, and finally the redo log is committed.

If innodb_flush_log_at_trx_commit is set to 1, then the redo log needs to be persisted once during the prepare phase, because there is a crash recovery logic that relies on the redo log of the prepare phase, together with the binlog, for recovery. (If your memory is a bit hazy, you can review the relevant content in the 15th article.)

With the background thread’s flush once per second and the crash recovery logic, InnoDB considers that fsync is not necessary during commit and that it is enough to only write to the page cache of the file system.

Usually when we mention MySQL’s “double 1” configuration, it means that both sync_binlog and innodb_flush_log_at_trx_commit are set to 1. In other words, before a transaction is fully committed, it needs to wait for two flushes: one for the redo log (during the prepare phase), and one for the binlog.

At this point, you may have a question: does this mean that if the TPS seen from MySQL is 20,000 per second, it will write to disk 40,000 times per second? But when I tested with a tool, the disk capacity was only around 20,000, so how can it achieve a TPS of 20,000?

To explain this problem, we need to use the mechanism of group commit.

Here, I need to introduce the concept of the logical sequence number (LSN). LSN is monotonically increasing and is used to correspond to the write points of redo logs. Each time a redo log of length “length” is written, the value of LSN is increased by “length”.

LSN is also written to InnoDB data pages to ensure that the data pages are not repeatedly executed with duplicate redo logs. I will explain the relationship between LSN and redo logs and checkpoints in detail in a later article.

As shown in Figure 3, it is the process of three concurrent transactions (trx1, trx2, trx3) in the prepare phase, after writing the redo log buffer and persisting it to disk, corresponding to LSN values of 50, 120, and 160, respectively.

Figure 3 Redo log group commit

From the figure, we can see that

trx1 is the first to arrive and will be selected as the leader of this group;
When trx1 is about to start writing to disk, there are already three transactions in this group, so the LSN also becomes 160.
When trx1 goes to write to the disk, the LSN it carries is 160. Therefore, when trx1 returns, all redo logs with LSN less than or equal to 160 have been persisted to the disk.
At this point, trx2 and trx3 can return directly.

Therefore, in a group commit, the more members there are, the better the effect of saving disk IOPS. However, if there is only a single-threaded load test, then each transaction can only correspond to one persistence operation.

In the case of concurrent updates, after the first transaction finishes writing to the redo log buffer, the later the fsync is called, and the more members there may be, the better the effect of saving IOPS.

In order to have more members included in one fsync, MySQL has an interesting optimization: delay. When I introduced two-phase commit, I once drew a diagram for you, and now I’ll show it again.

Figure 4 Two-phase Commit

In the figure, I treat “write binlog” as an action. But in reality, writing to the binlog is divided into two steps:

First, write the binlog from the binlog cache to the binlog file on the disk.
Then, call fsync to persist it.

To enhance the effect of group commit, MySQL delays the fsync of redo log until after step 1. In other words, the above diagram becomes like this:

Figure 5 Refined Two-phase Commit

In this way, binlog can also be group committed. When performing step 4 in Figure 5, if the binlogs of multiple transactions have already been written, they will be persisted together, reducing the consumption of IOPS.

However, in general, step 3 is executed quickly, so the interval between writing and fsync of binlog is short, resulting in fewer binlogs that can be grouped together for persistence. Therefore, the effect of group commit for binlog is usually not as good as the effect of redo log.

If you want to improve the effect of group commit for binlog, you can use the binlog_group_commit_sync_delay and binlog_group_commit_sync_no_delay_count parameters.

The binlog_group_commit_sync_delay parameter indicates the number of microseconds to delay before calling fsync.
The binlog_group_commit_sync_no_delay_count parameter indicates the number of accumulations before calling fsync.

These two conditions are alternatives, which means that if either condition is met, fsync will be called.

Therefore, when binlog_group_commit_sync_delay is set to 0, binlog_group_commit_sync_no_delay_count is also invalid.

Previously, someone asked in the comments, if the WAL mechanism reduces disk writes, why do we still need to write redo logs and binlogs every time a transaction is committed, and the number of disk reads and writes doesn’t decrease?

Now you can understand that the WAL mechanism benefits from two aspects:

Both redo logs and binlogs are sequentially written, and sequential writes are faster than random writes on disks.
Group commit mechanism can significantly reduce disk IOPS consumption.

Analyzing up to this point, let’s answer this question again: If your MySQL is currently experiencing performance bottlenecks, and the bottlenecks are in IO, what methods can be used to improve performance?

For this question, the following three methods can be considered:

Set the binlog_group_commit_sync_delay and binlog_group_commit_sync_no_delay_count parameters to reduce the number of disk writes for binlog. This method is based on “intentional waiting” and may increase the response time of statements, but there is no risk of data loss.
Set sync_binlog to a value greater than 1 (commonly between 100 and 1000). The risk of doing so is that when the host loses power, binlog logs may be lost.
Set innodb_flush_log_at_trx_commit to 2. The risk of doing so is that data may be lost when the host loses power. I do not recommend setting innodb_flush_log_at_trx_commit to 0. Setting this parameter to 0 means that the redo log is only saved in memory. As a result, if MySQL itself crashes, data will be lost, which is a significant risk. However, the speed of writing the redo log to the file system’s page cache is also very fast. Therefore, setting this parameter to 2 is similar in performance to setting it to 0, but in the event of a MySQL crash, data will not be lost and the risk will be smaller.

Summary #

In the second and fifteenth articles of this series, we analyzed how MySQL ensures crash safety when the redo log and binlog are complete. In today’s article, I focused on explaining how MySQL ensures that the redo log and binlog are “complete.”

I hope that the content of these three articles together will give you a clearer understanding of the concept of crash safety.

After the publication of the fifteenth article in this series, some students continued to leave comments asking some questions related to logs. In order to facilitate your review and learning, I will answer these questions again in one place.

Question 1: After executing an update statement, when I use the hexdump command to directly view the content of the ibd file, why don’t I see any data changes?

Answer: This may be due to the Write Ahead Log (WAL) mechanism. After the update statement is executed, InnoDB only guarantees that the redo log and memory have been written, but the data may not have been written to disk yet.

Question 2: Why is the binlog cache maintained by each thread, while the redo log buffer is globally shared?

Answer: The main reason for MySQL’s design is that the binlog cannot be “interrupted”. The binlog of a transaction must be written continuously, so it needs to be written to the file only after the entire transaction is completed.

On the other hand, the redo log does not have this requirement. Generated logs can be written to the redo log buffer. The content in the redo log buffer can also “hitch a ride” and be written to disk when other transactions commit.

Question 3: During the execution of a transaction, if a crash occurs before the transaction is committed, will this lead to inconsistency between the primary and replica databases?

Answer: No. At this time, the binlog is still in the binlog cache and has not been sent to the replica database. After a crash, both the redo log and binlog are lost, and from a business perspective, this transaction has not been committed, so the data is consistent.

Question 4: If a crash occurs after the binlog is written to disk but before the reply is sent to the client, and the server is restarted before the client reconnects, and the client later reconnects and finds that the transaction has already been successfully committed, is this a bug?

Answer: No.

Imagine a more extreme situation where the entire transaction has been successfully committed, the redo log has been committed, and the replica has received and executed the binlog. However, the connection between the primary server and the client is interrupted, causing the reply for the successful transaction to not be returned, and the client will also receive a “network disconnected” exception. This can only be considered as a successful transaction and cannot be considered a bug.

In fact, the crash-safe mechanism of the database guarantees:

If the client receives a message indicating a successful transaction, the transaction has definitely been persisted.
If the client receives a message indicating a failed transaction (such as a primary key conflict or a rollback), the transaction has definitely failed.
If the client receives a message indicating an “execution exception”, the application needs to reconnect and continue subsequent logic by querying the current state. At this time, the database only needs to ensure internal consistency (between data and logs, and between the primary and replica databases).

Lastly, it’s time for the question after class.

Today, I would like you to think about whether your production database is set to “double 1.” If it is normally set as such, have you ever changed it to “non-double 1” in any scenarios? And what was the basis for your decision?

Also, we all know that these settings may have drawbacks. If an exception occurs, what is your stop-loss plan?

Feel free to write down your understanding or experience in the comments section, and I will select interesting comments to share and analyze with you in the next article. Thank you for reading, and feel free to share this article with more friends.

Previous Question #

In the last article, I asked you to share your experience in “fire fighting” in production.

@Long mentioned several good scenarios in his comment.

Among them, the third question, “If a database is overwhelmed by client pressure and unable to respond, restarting the database is useless,” indicates that he has thought about this quite well. This is because after the restart, the business requests will still be sent, and since it is a restart, the buffer pool will be emptied, which may result in slower statement execution.
The fourth question he mentioned is also quite typical. Sometimes, a table may have multiple single-column indexes (often due to the lack of clear understanding of index principles by operations engineers), which may lead to the optimizer choosing the index merge algorithm. However, in reality, the efficiency of the index merge algorithm is not good. Instead, converting one of the indexes into a composite index is a good solution.