12 Engine Extension Reading Current Trends in Distributed Storage Engines

This lecture is the last lecture in the storage engine module. Through the study of this module, I believe you have gained a comprehensive understanding of the concept, usage, and technical details of storage engines. In this lecture, we will first summarize the main content of Module 2 and answer some typical questions that have been raised. Then I will introduce the three important elements for evaluating storage engines. Finally, I will introduce the currently popular storage engines for distributed databases.

Let’s start with a review of the content of this module.

Review of Storage Engines #

The storage engine is the core component of a database, playing the role of bridging the physical model and the logical model. It is an important function of the database and is the main executor of data writing, query execution, high availability, and transactional operations. It can be said that understanding the storage engine means mastering the main functions of the database.

In this module, I first introduced the positioning of the storage engine in the entire database and pointed out that it is actually a part of the local execution module. Then, I introduced the implementation differences of different types of storage engines by comparing several sets of concepts such as memory vs. disk and row-oriented vs. column-oriented. Finally, I explained the characteristics of storage engines for distributed databases, which are memory-oriented, column-oriented, and hash-friendly.

In Lecture 8, I introduced the indexes of distributed databases. I emphasized that most of the data files in the storage engine are actually index structures. Then, I explored the typical read paths of distributed database storage engines with you and introduced some typical techniques on this path, such as index tables, in-memory skip lists, Bloom filters, and secondary indexes.

Next, I introduced an extremely popular storage engine in the field of distributed databases: the LSM tree. I explained its specific structure, read-write-modify operations, and most importantly, the merge operation, which is a key operation of the LSM tree that directly affects its performance. Finally, I introduced the RUM hypothesis, which is a classic trade-off law for database optimization.

Finally, we discussed the most essential concept of storage engines, which is transactions. I spent two lectures explaining various aspects of transactions in detail. In summary, a transaction is actually a promise made by the database to the user, namely ACID. To fulfill this promise, the database utilizes various functional modules in the storage engine. The most important one is the transaction manager, which also requires components such as page cache, commit log, and lock manager to cooperate. Therefore, at the implementation level, the appearance of a transaction is quite vague. It has characteristics such as fault recovery and concurrency control, which are caused by its conceptualization being based on the ultimate use.

In the transaction part, we mainly focused on two points: fault recovery and isolation level. The former ensures that the database does not lose stored data, while the latter ensures the integrity of data during concurrent read and write operations. Meanwhile, we need to distinguish transactions from distributed consistency in Module 1. Please review Lecture 5 for more details.

Regarding transactions, one student raised the following question, and I will answer it for you now.

After the memory data is flushed to the disk, there is also a “truncation” operation performed on the log. What is the value of this truncation?

This “truncation” is a metaphor, which means that the data before the truncation point has already been written to the disk. When performing database recovery, you only need to restore the database from the truncation point, which greatly speeds up the recovery process and also releases the space of the log. This truncation point is usually called a checkpoint. You can learn more about related details on your own.

Above, we briefly reviewed the basic knowledge of this module. Next, I will show you some highlights of contemporary distributed database storage engines. But before we start, we need to use a model to evaluate their characteristics.

The Golden Triangle for Evaluating Storage Engines #

The characteristics of storage engines vary greatly, each with its own features. However, overall, we can describe their behavior using three variables: the way cache is used, whether the data is mutable or immutable, and whether the stored data is ordered or unordered.

Caching Method #

Caching refers to the storage engine’s practice of first writing data into a segment in memory before writing it to the disk, in order to consolidate the data. This segment consists of a series of blocks, with the block being the smallest unit written to the disk. Ideally, the blocks written to the disk should be full, as this maximizes efficiency.

Most storage engines use caching, but they employ different methods. For example, the WiredTiger caching mechanism leverages B-tree nodes to offset the performance issues associated with random read and write operations. On the other hand, LSM trees construct an ordered and immutable structure using caching. Therefore, the pattern of using caching is an important metric for evaluating storage engines.

Mutable/Immutable Data #

Whether the stored data is mutable or immutable is another dimension for evaluating the characteristics of storage engines. Immutable data is usually stored in the form of append-only logs, which allows for efficient writes. On the other hand, mutable data, represented by classic B-trees, emphasizes read performance. Therefore, mutability is generally considered an important metric for distinguishing between B-trees and LSM trees. However, a variant structure of B-trees called BW-Tree retains the characteristics of B-trees in terms of structure, but its data files are immutable.

Of course, immutable data does not mean that the data remains unchanged forever. It emphasizes whether the data is mutable in the most crucial write scenarios that affect performance. The merge operation in LSM trees involves the merging and splitting of data files, without blocking read and write operations, and during this process, some data may be deleted.

Sorting #

The last variable is whether the data is sorted during storage. Sorting is advantageous for range scanning and enables operations such as between queries. Range scanning also serves as an effective tool to implement secondary indexing and data classification. Both LSM trees and B+ trees, which are introduced in this module, support data sorting.

On the other hand, not sorting is generally an optimization for writes. If data is stored directly on the disk in the order in which it is written, without the need for reordering, the write performance will be excellent. The write operations of WiscKey and Bitcask, which we are going to introduce below, directly append data to the end of the file without sorting.

These are the three variables used to evaluate the characteristics of storage engines. I refer to them as the Golden Triangle because they are independent of each other and do not overlap, making it convenient to assess the characteristics of storage engines. Now let’s try to use this Golden Triangle to evaluate the characteristics of popular storage engines.

B-Tree Class #

As mentioned earlier, an important metric for evaluating storage engines is whether the data can be modified, and B-trees are a representative class of storage engines that allow for modifications. They are the most commonly used data structure in distributed and general databases. B-trees were developed to address the performance issues of search trees (such as BST) on HDD disks. Their structural characteristics include shallow height and wide width. During retrieval, they require fewer searches from top to bottom, and for B+ trees, it is even possible to load all non-leaf nodes into memory, resulting in a maximum of one disk operation for retrieval.

Now let me introduce several typical storage engines with B-tree structures.

InnoDB #

InnoDB is currently the default storage engine for MySQL and also the default storage engine for MariaDB 10.2 and later versions. Based on the evaluation criteria mentioned earlier, the B+ tree nodes of InnoDB are variable, and the data stored in leaf nodes is sorted. Due to continuous data writes, the B+ tree will horizontally expand, leading to the splitting of existing nodes into multiple nodes while maintaining the same height. InnoDB uses a caching mode to anticipate this splitting by reserving a portion of memory pages to accommodate potential node splits.

This reserved space is actually a form of waste and represents space amplification. Using the RUM hypothesis, InnoDB sacrifices space to optimize read and write operations.

At the transaction level, InnoDB implements complete isolation levels using the MVCC mechanism in conjunction with various pessimistic locking mechanisms to achieve different levels of isolation.

WiredTiger #

WiredTiger is the default storage engine for MongoDB. It addresses the problem of performance degradation in MongoDB when most of the data needs to be stored in memory and the memory is under pressure.

WiredTiger uses a B-tree structure instead of InnoDB’s B+ tree structure. This is primarily because MongoDB is a document-oriented database that stores data in a cohesive manner (you can think of it as adding additional columns to a relational database). Therefore, this type of database rarely performs join operations, does not require range scans, and can retrieve all data in one access. While the performance of individual B-tree queries may be unstable, the overall average performance is better than that of B+ trees.

Therefore, WiredTiger is a variable data structure, and since it does not perform sequential scans and the data is not sorted, how does it use caching? This part differs from InnoDB.

In the cache, each tree node is accompanied by an update buffer implemented using skip lists. When insertion or update operations occur, these data are written to the buffer instead of directly modifying the nodes. The advantage of this approach is that skip lists do not require additional space and have good concurrency performance. When flushing to disk, the data in the skip lists are merged with the node pages.

From this, it can be seen that WiredTiger sacrifices certain query performance in exchange for space utilization and write performance. This is because when querying, in addition to reading page data, the data in the skip lists need to be merged to obtain the latest data.

BW-Tree #

BW-Tree is the main technology stack behind Microsoft’s Azure Cosmos DB. It achieves high-performance B-tree-like structures by combining software and hardware optimizations, with hardware optimization using the Llama storage system. If you are interested, you can search and learn more about it. We will focus on the data structure optimization.

BW-Tree assigns a page ID to each node, and then all operations on the node are transformed into sequential write processes similar to LSM trees, meaning write and delete operations are completed through log operations. This structure effectively solves the write amplification and space amplification problems of B-trees. At the same time, because there are multiple small logs, concurrency is also improved.

When flushing to disk, data is flushed from logs, which transforms random writes into sequential writes, thus improving disk flushing efficiency. We will find that BW-Tree also has read amplification like LSM trees, that is, when querying, the underlying data needs to be combined with the log data. Moreover, if the logs are too long, it will slow down the reading process. In this case, Cosmos uses a hardware solution that detects the parts of the logs that need to be merged within the same log file and schedules them on the same processing node, thus accelerating the convergence of logs.

These are the typical B-tree-like storage engines, each with its own characteristics, and their optimization methods for the same problem also provide us with many insights.

LSM Class #

I dedicated a complete chapter to explaining the characteristics of this module. It is a typical immutable data structure, and caching is also achieved by transforming random writes into sequential writes. When we were talking about LSM trees, we introduced that the data they store is ordered. In fact, there are currently two unordered structures that are also receiving more and more attention.

Classic Storage #

Classic LSM implementations include LeveledDB and RocksDB, which were developed based on it. As we have previously introduced, their characteristics are to use caching to convert random writes into sequential writes, and then generate sorted and immutable data. They are friendly to writes and space, but sacrifice read performance.

Bitcask #

Bitcask is a storage engine for the distributed key-value database Riak. It is also a typical unordered storage structure. It fundamentally differs from the classical LSM tree introduced earlier, as it does not have an in-memory table structure. In other words, it does not perform caching and directly writes data to data files.

As you can see, its write efficiency is very high and it occupies very little memory. But how do you query this “heap” structure of data? The answer is that there is a structure called Keydir in memory that saves references to the latest versions of the data. Old data still resides in the data files but is not referenced by Keydir and will eventually be deleted by the garbage collector. Keydir is actually a hash table constructed from the data files when the database starts up.

This type of query obviously improves the read amplification problem of LSM trees because each piece of data has only one disk file reference and no cached data. Therefore, only one location needs to be queried to retrieve the data. However, its drawbacks are also obvious: it does not support range queries, and if the data volume is large at startup, the startup time can be quite long.

This structure optimizes writing, space, and single data retrieval, but sacrifices the ability to perform range queries.

WiscKey #

So, is there a structure that can take advantage of the high-speed writing and space utilization brought by unordered storage, and also support very useful range queries? The WiscKey structure is precisely an attempt to solve this problem.

Its characteristic is that the Key and Value are stored in two separate files. The Key is still in the form of an LSM tree, which ensures that the Key is ordered and range scans can be performed. At the same time, it uses the LSM tree, so it doesn’t need to put all the Keys in memory, which solves the slow loading problem of Bitcask.

The Value part is called vLogs (value Logs), and the data in it is unordered. This structure is suitable for scenarios with fewer updates and deletions because range scans require random reads. If there are many updates and deletions, the efficiency of conflict merging is very low. Also, during merge operations, it needs to scan the Keys and then determine the merge plan, which is not necessary in a regular LSM tree.

WiscKey is very suitable for running on SSDs because reading the Value requires random reads. Currently, Badger from dgraph.io is a mature implementation of this mode.

Summary #

Here, we have finished this lecture. I have reviewed the main content of Module 2 with you. This is a basic knowledge module that lays a foundation for the upcoming distributed modules. At the same time, compared to traditional relational databases, the storage engines of distributed databases also have their own characteristics, such as the LSM tree structure. You need to grasp this structure carefully.