10 Transaction Processing and Recovery How to Ensure Data Integrity After Database Collapses

10 Transaction Processing and Recovery - How to Ensure Data Integrity After Database Collapses #

In the previous lesson, we discussed an LSM tree, which is a typical storage engine used in distributed databases. In this lesson, I will introduce the essence of storage engines, which is transaction management. First, I will describe the characteristics of transactions, known as ACID, from a user’s perspective. Then I will briefly explain how storage engines support these features through various components.

In order to maintain these features, transaction managers need to consider various possible problems and failures, such as database process crashes, disk failures, etc. I will explain how to ensure ACID properties in the “Database Recovery” section when facing various failures.

Since there is a lot of content in this part, I have divided it into two lessons. In the next lesson, I will continue to introduce the isolation levels and concurrency control of databases. These are gifts provided by databases to application developers, making it very easy to achieve consistency of concurrent data.

That’s the learning outline of this part. Now let’s start with an overview of transactions.

Overview of Transactions #

Transaction management is a relatively independent and important component of the storage engine in a database, which can ensure that a series of operations on the database appear as if they were performed in a single step. This greatly simplifies the development of database-oriented applications, especially in high-concurrency scenarios.

A typical example is a transfer operation: transferring 100 yuan from Account A to Account B. In real life, this operation is atomic because paper currency cannot be duplicated. However, in a computer system, this operation is actually composed of two operations: deducting 100 yuan from Account A and adding 100 yuan to Account B. These two operations may face risks. For example, while transferring the money, Account C also transfers 100 yuan from Account A (at this time, the 100 yuan transfer to Account B has not been deducted). And if there is not enough money in the account at this time, one of these two operations may fail. Another example is that during the execution of the two operations, the database crashes, and after restarting, it is found that Account A has lost 100 yuan, but Account B has not increased, or vice versa.

In order to solve similar problems mentioned above, people have proposed the concept of transactions in databases, especially at the storage engine level. Now let’s talk about the classic ACID properties of transactions.

ACID #

A: Atomicity

Atomicity ensures that all operations within the transaction are indivisible, meaning that they either all succeed or all fail, with no partial success. The success is indicated by a commit operation at the end of the transaction, which is considered as the success of the entire transaction. On the other hand, failures can be divided into two cases: performing a rollback operation or the database process crashing and exiting.

Atomicity is a guarantee provided by the database to the user, in order to simulate atomic operations in real life, such as the transfer mentioned above. In real life, some seemingly indivisible operations are not a single operation when translated into computer operations. Atomicity is the guarantee for atomic operations in real life.

C: Consistency

Consistency is actually controlled jointly by users and the database, rather than just a service provided by the database. It is first and foremost a business-level constraint. For example, in the example mentioned earlier, when transferring 100 yuan from Account A to Account B, the business application must ensure that 100 yuan is deducted from Account A and 100 yuan is added to Account B, and the consistency brought about by this operation has nothing to do with the database. The database ensures that these two correct actions can produce the final correct result through ACID.

The consistency here is fundamentally different from the distributed consistency discussed in Module 1. If you want to understand the detailed comparison, please refer to “Module 05 | Consistency and CAP Theorem: Why Do We Need Distributed Consistency”. I will not go into too much detail here.

I: Isolation One of the great things about transactions is that they can handle concurrent operations, meaning that different transactions can run without interfering with each other, as if no other transactions are happening. People who do concurrent programming have a deep appreciation for this, as handling concurrent operations requires a level of effort and experience that is completely different from handling serial operations. Isolation is precisely what turns actually concurrent operations into something that appears serial from the perspective of the user, greatly reducing the difficulty of use.

Of course, in practical cases, the strong concurrency control described above can result in lower performance. Generally, databases define multiple isolation levels to provide different levels of concurrency processing capabilities, which means that a transaction may be visible to other transactions at lower isolation levels. I will explain this in detail in the “Isolation Levels” section.

D: Durability

Durability is relatively simple: once a transaction is committed, the modifications it made to the database can be preserved. It is important to note that “preserved” not only means that other transactions can query it, but more importantly, the committed data can still be read completely after the database recovers from system failures or process crashes.

These are the four important characteristics of transactions. So, which components within the storage engine satisfy these characteristics? In the following sections, I will introduce a typical transaction management component.

Transaction Manager #

Transactions are primarily controlled by the transaction manager, which is responsible for coordinating, scheduling, and tracking the status and execution steps of each transaction. Of course, this is different from the transaction manager in distributed two-phase commit (2PC). I will explain more about distributed transactions in the next module.

Page Cache #

When it comes to the transaction manager, the first thing to mention is the page cache or buffer pool. It serves as an intermediate layer between the disk and other components of the storage engine. Data is first written to the cache and then synchronized to the data disk. It is generally transparent to other components, especially for writes, where the writing component believes that data is being written to the disk, but in reality, it is being written to the cache. At this point, if a system failure occurs, there is a risk of data loss, so the means to recover transactions, which will be introduced in the next section “How to Recover Transactions,” are needed to solve this problem.

Caching primarily solves the speed difference between memory and disk and can optimize database performance without changing algorithms. However, memory is limited and it is not possible to cache all data on the disk. This is where flushing comes in to release cache. Flushing is generally performed asynchronously and periodically, which has the advantage of not blocking normal writes and reads.

During flushing, dirty pages (modified page caches) cannot be immediately released if they are referenced by other objects. They can only be released from the cache when they are no longer referenced. Flushing operations also need to be coordinated with commit log checkpoints to ensure durability.

When the cache reaches a certain threshold, some old values have to be removed from the cache. This is where cache eviction algorithms come in to help free up space. There are algorithms such as FIFO, LRU, clock replacement, and LFU. If you’re interested, you can learn more about them based on these keywords.

Finally, there are some data that we want to keep in the cache and not be affected by eviction algorithms. In this case, we can “pin” them in the cache. For example, high nodes of B-trees are generally small in size and need to be accessed for every query. Some frequently accessed metadata is also kept in the cache for a long time.

Log Manager #

Next is the log manager, which maintains a history of operations on a set of data. If the data in the cache has not been flushed to the disk and the system crashes, the data can be recovered by replaying the logs. In addition, in rollback scenarios, these logs can be used to revert the data to its original state.

Lock Manager #

Lastly, we have the very important lock manager, which ensures that transactions accessing shared resources do not violate the integrity constraints of those resources. It also ensures that transactions can be executed serially. I will explain more about locks in detail later. The above is the main components of transaction management. Now, I will introduce the related content of log management from the perspective of transaction recovery.

How to recover transactions in a database #

A database system is a complex ecosystem consisting of a series of hardware and software components, each of which has the potential to cause various stability issues. When these components are combined into a database system, this potential is further magnified. The designers of the database must provide their own solutions to these potential stability issues and make some “commitments” for the database.

The commit log, also known as CommitLog or WAL (Write-Ahead Log), is an effective means to address these issues. This log records all operations of the database and is appended to a log file on the disk.

As mentioned earlier, write operations in the database are first written to the cache and then flushed to the disk. However, before flushing to the disk, these data are actually saved in the form of logs in the commit log on the disk. When the data is only resident in the cache and has not been flushed to the disk, these logs can ensure data durability. That is, once the database encounters a failure, data can be recovered from the logs.

So what are the characteristics of the commit log to ensure these functions? Let’s take a look at the characteristics of the log.

Characteristics of the commit log #

First of all, the commit log is very similar to the disk file characteristics of LSM trees introduced in the previous lecture, both of which are sequentially written and immutable. The benefits are also the same: sequential writing ensures high performance for writing, and immutability guarantees that data can be safely read from beginning to end.

The commit log is generally assigned a sequence number as a unique key, which is not an incremental number but a timestamp. In addition, each log is very small, and some databases will cache them and then write them to the disk in batches. This means that in default cases, the logs cannot completely recover the database. This is a consideration for performance. Most databases provide different parameters to describe the behavior of log cache flushing, allowing users to make a balance between performance and data recovery integrity.

When a transaction is committed, it must ensure that its logs have been written to the commit log. That is, the complete writing of the transaction content to the log is a very important indicator of the completion of the transaction.

Logs can theoretically grow indefinitely, but in practice, it is meaningless. Because once the data is flushed from the cache to the disk, the logs before this operation become meaningless, and the logs can be truncated to release space. The point at which they are truncated is generally called a checkpoint. Dirty pages in the page cache before the checkpoint need to be completely flushed to the disk.

When implementing logs, they are generally composed of a group of files. Logs are sequentially written to the files, and if the data in a file is all old data before the checkpoint, the new logs can overwrite them, thereby avoiding the need to create new files. At the same time, different files are placed on different disks to improve the availability of the log system.

Physical log Redo Log and logical log Undo Log #

The modification of data by a transaction is actually a change of state, such as changing 3 to 5. Here, we call 3 the before-image and 5 the after-image. We can obtain the following equation:

  1. Before-image + redo log = after-image
  2. After-image + undo log = before-image The redo log stores all the historical records of page and data changes, which we call it a physical log. The undo log requires an original state and contains operations relative to this state, so it is also called a logical log. We use redo and undo to perform forward or backward data conversion, which is actually the basis of transactional operation algorithms.

Steal and Force Strategies #

There are two write strategies for redo and undo: steal and force.

The steal strategy allows the write of uncommitted cached data to the database, while no-steal does not. We can see that if it is a steal mode, it means that the data has changed from the later image to the earlier image, which requires the cooperation of the undo log. The overwritten data is written to the undo log to restore the data when the transaction is rolled back, so as to restore to the earlier image state.

The force strategy means that when a transaction is committed, all operations need to be flushed to disk, while no-force does not require it. We can see that if it is no-force, the data on the disk is still in the earlier image state. This requires the redo log to cooperate so that the data in the cache can be recovered from the redo log after a system failure and change to the later image state.

From the above, most modern database storage engines have undo logs and redo logs, so they are databases with steal/no-force strategies.

Now let’s talk about an algorithm.

ARIES Recovery Algorithm #

This algorithm is also known as Algorithm for Recovery and Isolation Exploiting Semantics.

The algorithm uses both the undo log and redo log to complete the recovery work after a database failure. The processing flow is divided into the following three steps.

  1. First, after the database restarts, it enters the analysis mode. Check the dirty page status of the database at the time of the crash to identify where to start the data recovery from the redo log. At the same time, collect undo information to rollback unfinished transactions.
  2. Enter the redo execution phase. In this process, the data that is in the page cache but has not been persisted to disk is recovered by replaying the redo log. Note that in addition to recovering committed data, some of the uncommitted data is also recovered.
  3. Enter the undo execution phase. In this phase, all the unrecovered transactions recovered in the previous phase are rolled back. To prevent duplicate execution of undo operations during this phase in case of another database crash, the storage engine will record the executed undo operations to prevent them from being repeated.

Although the ARIES algorithm has been proposed for many years, its concepts and execution process still play an important role in modern storage engines.

Above, we have discussed how databases recover data and maintain consistency. It corresponds to the AD in ACID (as mentioned earlier, a constraint that is generally not considered a feature provided by databases). At the same time, we should also be aware that database recovery technologies represented by commit logs play an important role in databases without the concept of transactions. This is because the page cache is ubiquitous, and the main function of commit logs is to solve the problem of data loss in case of power failure in main memory.

Summary #

That’s all for this talk. We started with the principles of ACID and introduced many components for managing transactions. We then focused on how to ensure data durability, mainly through commit logs. The remaining concepts related to isolation will be discussed in the next talk.