02 Log System How a SQL Update Statement is Executed #

In the previous section, we learned about the execution process of a SELECT statement and introduced the processing modules involved in the process. As you may recall, the execution process of a SELECT statement typically involves modules such as the connector, analyzer, optimizer, executor, and finally the storage engine.

So, what is the execution process of an UPDATE statement?

You may have often heard your DBA colleagues say that MySQL can recover to any second within the last half month, which is quite impressive. This might have made you curious about how this is achieved.

Let’s start with an UPDATE statement on a table. Here is the creation statement for the table, which has a primary key ID and an integer field c:

mysql> create table T(ID int primary key, c int);

To increment the value of the row where ID=2 by 1, the SQL statement would be written as:

mysql> update T set c=c+1 where ID=2;

Earlier, I introduced you to the basic execution chain of SQL statements, and I will bring back that diagram for you to look at as a quick review. First of all, it can be said with certainty that the same set of processes involved in executing a query statement will also be followed for an update statement.

MySQL Logical Architecture

Before executing the statement, you need to establish a connection to the database, which is the responsibility of the connector.

As mentioned earlier, whenever there is an update on a table, the query cache related to that table is invalidated, resulting in the removal of all cached results from table T. This is why we generally do not recommend using the query cache.

Next, the analyzer determines that this is an update statement through lexical and syntax analysis. The optimizer decides to use the ID index. Then, the executor is responsible for the actual execution, which involves locating the row and updating it.

Unlike the query process, the update process also involves two important logging modules, which are the main focus of our discussion today: redo log and binlog. If you have worked with MySQL, these terms must be familiar to you, and I will continue to emphasize them in the content that follows. However, coming back to the point, redo log and binlog have many interesting design aspects, and these design ideas can also be applied to your own programs.

Important Logging Modules: Redo Log #

I don’t know if you still remember the story “Kong Yiji”. The innkeeper had a slate used specifically to record customers’ credit records. If there were not many customers with credit accounts, he could simply write their names and balances on the slate. However, if there were too many credit customers, there would be times when the slate was not enough to record everything. In such cases, the innkeeper would definitely have a specialized ledger to record credit accounts. If someone wants to extend credit or settle accounts, the shopkeeper generally has two approaches:

One approach is to directly pull out the account book and add or subtract the credit from it.
Another approach is to first record the credit on a pink board, and then calculate it when the store closes by pulling out the account book.

When the business is booming and the counter is busy, the shopkeeper will definitely choose the latter approach because the former operation is too cumbersome. First, you have to find the record of the total amount owed by this person. Just think about it, with dozens of densely packed pages, the shopkeeper has to find that name, maybe even bring out reading glasses to search slowly. After finding it, they have to take out the abacus to calculate, and finally write the result back into the account book.

The whole process is quite troublesome to even think about. In comparison, it is more convenient to first record on the pink board. Just think about it, if the shopkeeper didn’t have the pink board to help, each time they recorded an account, they would have to flip through the account book. Isn’t the efficiency so low that it’s unbearable?

The same problem exists in MySQL. If each update operation needs to be written to the disk, and then the disk has to find the corresponding record and update it, the I/O cost and lookup cost of the whole process would be very high. To solve this problem, the designers of MySQL used a similar approach to the pink board used by shopkeepers to improve update efficiency.

The whole process of using the pink board and the account book together is actually what is often referred to as Write-Ahead Logging (WAL) technology in MySQL. The key point of WAL is to write the log first, then write to the disk, similar to first writing on the pink board and then writing on the account book when it is not busy.

Specifically, when a record needs to be updated, the InnoDB engine will first write the record to the redo log (pink board) and update the memory. At this point, the update is considered complete. At an appropriate time, the InnoDB engine will update this operation record to the disk. This update often happens when the system is relatively idle, just like what the shopkeeper does after closing.

If there aren’t many credit transactions that day, the shopkeeper can wait until closing to organize them. But what if there are a lot of credit transactions one day and the pink board is filled up? At this time, the shopkeeper has to put down the work at hand, update some of the credit records from the pink board to the account book, and then erase these records from the pink board to make space for recording new accounts.

Similarly, the redo log in InnoDB is of fixed size. For example, it can be configured as a set of four files, each file being 1GB in size, so this “pink board” can record a total of 4GB of operations. Writing starts from the beginning and cycles back to the beginning after writing to the end of the third file, as shown in the following image.

The write pos is the current position of the record. It moves forward while writing, and when it reaches the end of the third file, it returns to the beginning of the 0th file. The checkpoint is the current position to be erased. It also moves forward and cycles back, and before erasing the record, the record needs to be updated to the data file.

The space between the write pos and checkpoint is the empty space on the “pink board” that can be used to record new operations. If the write pos catches up with the checkpoint, it means the “pink board” is full. At this point, no new updates can be executed. The shopkeeper has to stop and erase some records to advance the checkpoint.

With the redo log, InnoDB can guarantee that even if the database encounters an abnormal restart, previously committed records will not be lost. This ability is called crash-safe.

To understand the concept of crash-safe, think back to our earlier example of credit records. As long as the credit records are recorded on the pink board or written in the account book, even if the shopkeeper forgets, such as suddenly closing for a few days, the credit accounts can still be clarified through the account book and the pink board data when the business resumes.

Important Log Module: binlog #

As mentioned earlier, if we look at MySQL as a whole, it actually has two parts: the Server layer, which mainly handles MySQL’s functional aspects, and the engine layer, which is responsible for storage-related specifics. The pink board redo log we talked about earlier is a log unique to the InnoDB engine, while the Server layer also has its own log, called the binlog (binary log).

I’m sure you must be wondering, why are there two logs? Because the InnoDB engine was not initially included in MySQL. The default engine in MySQL is MyISAM, but MyISAM does not have crash-safe capabilities and the binlog logs can only be used for archiving. InnoDB was introduced to MySQL by another company as a plugin. Since it is not crash-safe to rely solely on the binlog, InnoDB uses a separate logging system called the redo log to achieve crash-safe capabilities.

These two types of logs have the following three differences:

The redo log is specific to the InnoDB engine, while the binlog is implemented at the server layer of MySQL, so it can be used by all engines.
The redo log is a physical log that records “what modifications were made to a certain data page,” while the binlog is a logical log that records the original logic of the statement, such as “increment the ‘c’ field of the row with ID=2 by 1.”
The redo log is circular and the space is fixed, whereas the binlog can be appended. “Appending” here means that when the binlog file reaches a certain size, it switches to the next file without overwriting the previous logs.

With a conceptual understanding of these two logs, let’s take a look at the internal process of the executor and the InnoDB engine when executing this simple update statement.

The executor first requests the engine to retrieve the row with ID=2. Since ID is the primary key, the engine directly searches for this row using a tree. If the data page containing the row with ID=2 is already in memory, it is returned directly to the executor; otherwise, it needs to be read from disk into memory before being returned.
The executor receives the row data from the engine, adds 1 to the value, for example, if it was N, it becomes N+1, obtains the new row data, and then calls the engine interface to write this new data.
The engine updates this new row data in memory and at the same time records this update operation in the redo log. At this point, the redo log is in the prepare state. Then it informs the executor that the execution is complete and the transaction can be committed at any time.
The executor generates the binlog for this operation and writes the binlog to disk.
The executor calls the engine’s transaction commit interface, and the engine changes the previously written redo log to the commit state, indicating that the update is complete.

Here is a flowchart of the execution process for this update statement. The light-colored boxes represent actions performed internally by InnoDB, while the dark-colored boxes represent actions performed in the executor.

Execution process of an update statement.

You may have noticed that the last three steps seem a bit “complicated” as they separate the writing of the redo log into two steps: prepare and commit. This is known as “two-phase commit.”

Two-Phase Commit #

Why is there a need for “two-phase commit”? This is to ensure logical consistency between the two logs. To explain this, let’s go back to the question raised at the beginning of the article: How can we restore the database to any second within the past two weeks?

As mentioned earlier, the binlog records all logical operations and uses “appending” as a method. If your DBA promises that the database can be restored within the past two weeks, then the backup system must store all binlogs for the past two weeks, in addition to regular backups of the entire database. The frequency of these “regular” backups depends on the importance of the system and can be daily or weekly.

When you need to restore to a specific second, for example, if you discover at 2 PM that a table was accidentally deleted at 12 PM and you need to recover the data, you can do the following:

First, find the most recent full backup, which could be the backup from last night if you’re lucky, and restore it to a temporary database.
Then, starting from the backup’s timestamp, retrieve the binlogs in sequence and replay them until just before the table was mistakenly deleted at noon.

Now your temporary database is identical to the online database before the deletion. You can extract the table data from the temporary database and restore it to the online database as needed.

Alright, now that we’ve discussed the data recovery process, let’s talk about why logs need “two-phase commit.” Let’s use a proof by contradiction to explain.

Since the redo log and binlog are two independent logic logs, if they are not committed in two phases, either the redo log is written before the binlog or vice versa. Let’s see what problems these two approaches might have.

Let’s continue using the previous example of the update statement. Assuming the current row with ID=2 and the value of field c is 0, let’s assume that during the update statement, a crash occurs before the second log is written while the first log has been written. What would happen?

Write redo log first, then write binlog. Suppose a crash happens after the redo log is written, but the binlog is not fully written yet. As we mentioned before, even if the system crashes after the redo log is written, the data can still be recovered. So after recovery, the value of c for this row would be 1. However, since the binlog was not fully written before the crash, this statement is not recorded in the binlog. Therefore, when backing up the logs later, the binlog stored will not have this statement. Then, if you need to restore the temporary database using this binlog, because this statement is missing from the binlog, the temporary database will be missing this update, and the value of c restored will be 0, different from the original database.
Write binlog first, then write redo log. If a crash happens after the binlog is written, but the redo log has not been written yet, the transaction becomes invalid after the crash and recovery. So the value of c for this row would be 0. However, the binlog has already recorded the log “change c from 0 to 1”. Therefore, when using the binlog for recovery later, an additional transaction would be added, and the value of c restored will be 1, different from the original database.

As we can see, if we don’t use “two-phase commit,” the database’s state may be inconsistent with the state that would be recovered from its logs.

You might wonder if the probability of this happening is low, and if there are not many scenarios where we need to recover the temporary database due to mistakes.

Actually, that’s not the case. This process is not only used for recovering data after mistakes. When you need to scale up, i.e., when you need to add more standby databases to increase the system’s read capacity, the common practice is also to use full backups combined with applying binlogs. This “inconsistency” can lead to inconsistencies between the primary and secondary databases.

In summary, the redo log and binlog can both be used to represent the commit status of transactions, and the two-phase commit ensures logical consistency between these two states.

Conclusion #

Today, I introduced the two most important logs in MySQL, namely the physical log, redo log, and the logical log, binlog.

The redo log is used to ensure crash-safe capability. When the parameter innodb_flush_log_at_trx_commit is set to 1, it means that the redo log of each transaction is directly persisted to disk. I recommend setting this parameter to 1 so that data is not lost after an abnormal restart of MySQL.

When the parameter sync_binlog is set to 1, it means that the binlog of each transaction is persisted to disk. I also recommend setting this parameter to 1 to ensure that the binlog is not lost after an abnormal restart of MySQL.

I also introduced “two-phase commit,” which is closely related to the MySQL logging system. Two-phase commit is a commonly used solution for maintaining data logical consistency across systems. Even if you are not involved in database kernel development, you may encounter it in your daily development.

Finally, I’ll leave you with a thought-provoking question. Earlier, I mentioned that the backup period for periodic full backups “depends on the system’s importance, some are taken daily, and some are taken weekly.” In what scenarios would a daily backup be more advantageous than a weekly backup? Or in other words, which metric of this database system does it affect?