09 Log Structured Storage Why Choose It as a Foundational Storage Layer

09 Log-Structured Storage - Why Choose It as a foundational Storage Layer #

In the previous lecture, we learned about the logical concepts and architecture of storage engines. These concepts and architecture are highly abstract images that summarize the characteristics of several storage engines. The purpose is to help you have a general understanding of database storage engines, especially distributed database storage engines, and establish a knowledge system.

However, abstract content without specific examples is not enough to understand the relevant concepts. In this lecture, I will take the classic Log-Structured Merge Tree (LSM Tree) as a starting point to intuitively demonstrate the design characteristics of storage engines. At the same time, I will explain why such storage engines are particularly suitable for distributed databases.

So, first, let’s start by introducing the structural characteristics of the LSM Tree.

Structure of LSM Tree #

The structure of the LSM Tree storage engine is implied in its name. LS stands for Log-Structured, indicating that it stores data in the form of logs. So what are the characteristics of logs? If you have some understanding of financial accounting, you will know that when deleting a record, accountants do not directly erase the record with an eraser. Instead, they write an offsetting operation equal to the original amount. This is a typical pattern of log-structured storage.

The characteristic of log-structured storage is that it is very friendly to write operations. Unlike other structures such as B-trees that require random writes, log-structured storage can perform sequential writes. This is because the commonly used HDD disk has a rotating mechanism, and the writing delay mainly occurs in the rotation of the disk and the movement of the read-write arm. If data can be written sequentially, it can greatly accelerate the writing speed of this disk mechanism.

The “M” in LSM Tree implies that this structure involves merge operations to form the final readable structure. This way, read operations do not need to search for all the changes to that record, thereby speeding up the reading speed. Moreover, merging multiple records into one final result also saves storage space. Although merge operations have many advantages, they are not without cost, which is the consumption of computation and storage space.

Now let’s start a detailed introduction to the structure of the LSM Tree.

The LSM Tree consists of memory-resident units and disk-resident units. First, the data is written to a buffer in memory, and then written to an immutable file on the disk.

The memory-resident unit is generally called MemTable, which is a mutable structure. It is used as a buffer for storing data temporarily and provides read services to the outside. When the amount of data reaches a threshold, the data will be batch written to the immutable file on the disk.

As we can see, its main function is to sort the data to be written to the disk, and batch writing data can improve the efficiency of writing. However, once the database crashes, the data in memory will be lost, and at this time, the commit log mentioned in “07 | Overview: What is a storage engine and why do we need to understand it” needs to be introduced for log replay to recover the data in memory. But the premise is that the data needs to be recorded in the commit log before it is written to memory.

The disk-resident unit, also known as data files, is generated when the memory buffer is flushed to disk. These data files are immutable and can only provide read services. In contrast, the MemTable provides both read and write services.

There are generally two types of LSM Tree structures: double-tree structure and multi-tree structure. The former is generally a theoretical explanation, and currently, there is no actual storage engine that uses this structure. So I will briefly explain the concept of the double-tree, which will help you understand the multi-tree structure.

In the double-tree, the two trees refer to a tree in the memory-resident unit and a tree in the disk-resident unit, which you can imagine as both being B-tree structures. When flushing the memory data to the disk, the memory data is merged with some data on the disk, and then written under a node of the large tree on the disk. After the merge is successful, the memory data before the merge and the disk data will be removed.

We can see that the double-tree operation is relatively straightforward and can exist as a B-tree-like index structure. However, in practice, there are hardly any storage engines using it. The main reason is that its merge operation is synchronous, which means that the merge needs to be performed synchronously during the flush. And flushing itself is a relatively frequent operation, which will cause write amplification, affecting the efficiency of writing and occupying a very large disk space. The Multi Tree Structure is proposed based on the Two Tree Structure, and it does not perform merge operations when flushing memory data to disk, but instead writes the memory data completely to a separate file. However, this gives rise to another problem: as flushing continues, the number of files on the disk increases rapidly. As a result, reading operations have to search for records in many files, leading to a sharp decrease in data reading efficiency.

To solve this problem, this structure introduces a merge operation (Compaction). This operation is performed asynchronously. It selects a portion of the files from the many files, reads their contents, and then merges them into a new file, which replaces the old files. The diagram below shows a typical merge operation in a multi-tree structure. This structure is also the main structure discussed in this lecture.

Finally, let me explain the flushing process in detail.

First, let’s define several roles, as shown in the table below.

The data is initially written to the current memory table. When the data volume reaches a threshold, the current data table changes its state to flushing and stops accepting write requests. At this point, another memory table is created to accept write requests. After the flushing is completed, the data is on the disk. In addition to discarding the data in the obsolete memory table, the commit log is truncated. Then, the new data table is set to the readable state.

At the beginning of the merge operation, the tables to be merged are set to the merging state, and they can still accept read operations. After the merge is completed, the original table is invalidated, and the new table starts providing read services.

These are the classic structure and some operational details of the LSM tree. Now, let’s begin to discuss how to perform query, update, and delete operations on it.

Query, Update, and Delete Operations #

The query operation itself does not have special operations specific to the LSM tree. Since the target data may be in the memory table or multiple data tables, the result data from multiple data sources needs to be merged. Sorting and merging operations are used for this purpose, because regardless of whether it is a memory table or a data table, the data inside them is already sorted. The sorting and merging algorithm is widely used in various databases, such as Oracle, MySQL, and so on. Additionally, Apache ShardingSphere, a database middleware, also uses this method to handle order by statements involving multiple data sources. If you are interested, you can research this further. I won’t cover it in detail here.

Another issue when performing queries is handling different versions of the same data. Although the merge operation can solve some of these problems, queries are still required to resolve the data before merging. As I mentioned earlier, modifications and deletions of data in the LSM tree are essentially the addition of a new record. Therefore, in the data tables and memory tables, there may be multiple records for a single piece of data. In this case, conflict handling is required during queries. Generally, data can be identified by having the same key, and different versions often have timestamps. By using these timestamps, the write order can be established, similar to the concept of vector clocks. Therefore, it is easy to determine the latest data in queries.

Update and delete operations essentially insert data and then obtain the final data through the conflict resolution mechanism and merge operations mentioned above. Updating data is relatively straightforward, just insert the new data. But what is inserted for deletion?

Generally, a special value called a tombstone is inserted to indicate that the record has been deleted. If you need to delete data within a range, Apache Cassandra uses range tombstones.

For example, if there are 9 pieces of data ranging from k0 to k9, a start deletion marker is set at k3 (including k3), and an end deletion marker is set at k7 (excluding k7), then the four pieces of data from k3 to k6 are deleted. At this point, queries will not retrieve k4 to k6, even if tombstones are not set on them.

Above, we have introduced the basic operations of the LSM tree. Now, let me explain one very special operation: the merge operation.

Merge Operation #

The merge operation is used to maintain the structure of the LSM tree to ensure its normal operation. It is worth noting that the merge operation we are referring to here is specifically for the multi-tree structure mentioned in the LSM tree. In the multi-tree structure, the number of tables on the disk increases as the flushing action continues. The merge operation is a means to reduce the number of tables.

The merge operation selects several files from the data files on the disk according to certain rules, and then writes the new file to the disk. After a successful merge, the old data is deleted. The memory consumption is completely controllable throughout the operation. This is because each data file is sorted. Similar to the query rules mentioned in the previous lecture, we can still merge the data in multiple files through sorted merge. This kind of merge only loads partial data each time, which is the data at the beginning of each file, and performs the merge operation in memory. This effectively controls the memory consumption of the merge operation.

During the entire merge process, the old data tables can still provide read services externally, indicating that the old data still exists on the disk. This requires sufficient additional space on the disk to accommodate the newly generated data tables. At the same time, merge operations can be executed in parallel, but in general, the data they operate on does not overlap to avoid competition issues. Merge operations can merge multiple data files into one file, or split one data file into multiple files.

The commonly used merge strategies are Size-Tiered Compaction and Leveled Compaction.

Size-Tiered Compaction #

The following figure shows the merge process of this strategy.

In this strategy, data tables are merged based on their sizes, and smaller data tables are gradually merged into larger ones. The first level stores the smallest data tables in the system, which are just flushed from the memtable. The merge process is to merge smaller data tables in the lower levels into larger ones in the higher levels. Apache Cassandra has used this merge strategy.

The advantage of this strategy is that it is relatively simple and easy to implement. However, it has poor space amplification, which becomes more severe as the levels increase. For example, if two 5GB files need to be merged, at least 10GB of disk space is required to complete this operation. It can be imagined that this capacity pressure is huge and will inevitably cause system instability.

Is there any strategy that can alleviate space amplification? The answer is Leveled Compaction.

Leveled Compaction #

As the name suggests, this strategy divides data tables into multiple levels, numbered L0 to Ln. The L0 level consists of data tables flushed from the memtable, and the keys in this level can intersect. Data tables in L1 and higher levels are obtained by splitting the large data tables in Size-Tiered Compaction into multiple small data tables with non-overlapping keys. Each level has a maximum data threshold, and when this threshold is reached, a merge operation is triggered. The thresholds for each level are exponentially distributed. For example, RocksDB documentation describes a distribution: L1 is 300MB, L2 is 3GB, L3 is 30GB, and L4 is 300GB.

The strategy is shown in the following figure.

The figure above shows, in a summary form, that each small data table in L1 and above has the same capacity, and the data threshold increases by 10 times. That is, L1 can have a maximum of 10 data tables, L2 can have a maximum of 100, and so on.

As data tables are continuously written, the data volume in L1 will exceed the threshold. At this time, at least one data table in L1 is selected, its data is merged into the files in L2 with intersecting keys, and the corresponding data is deleted from L1. Still taking the diagram as an example, the key range of an L1 level data table can roughly correspond to 10 data tables in L2, so one merge will affect 11 files. After this merge is completed, the data volume in L2 may exceed the threshold, triggering the merge from L2 to L3, and so on.

As can be seen, compared with Size-Tiered Compaction, Leveled Compaction does not need to select all the data within a layer during each merge, and the key ranges of the data tables in each layer do not overlap, reducing duplicate keys and greatly alleviating the problem of space amplification.

Of course, in practical applications, a combination of both strategies is used. For example, the classic RocksDB uses Size-Tiered Compaction when merging from L0 to L1, and Leveled Compaction starting from L1. The reason for this is that the data tables in L0 will definitely have the same keys.

The above introduces the classic merge problem in LSM tree. During the merge process, various dilemmas are often encountered, such as the amplification of space as mentioned earlier. Now I will introduce the RUM hypothesis to analyze such problems in detail.

RUM Hypothesis #

Before introducing this hypothesis, you need to understand several concepts of “amplification”.

Read amplification. It comes from the need to retrieve data from multiple files and resolve data conflicts during reading. As shown in the query operation, the more targets to read, the greater the impact on the reading operation, and merge operations can effectively alleviate the problem of read amplification.
Write amplification. For LSM trees, write amplification comes from continuous merge operations, especially Leveled Compaction, which can cause multiple levels of consecutive merge operations, resulting in exponential growth of write amplification.
Space amplification. This is the concept I mentioned when talking about merge, which refers to the fact that data with the same key is placed in multiple copies, which is generated during the merge operation. Especially Size-Tiered Compaction has a serious problem of space amplification.

So, can we solve these three problems at the same time? According to the RUM hypothesis, the answer is no.

This hypothesis summarizes the three key parameters for optimizing database systems: Read, Update, and Memory, which correspond to RUM. Corresponding to the three amplification problems mentioned above, R corresponds to read amplification, U corresponds to write amplification, and M corresponds to space amplification (Memory can be understood as storage in a broader sense, not just referring to memory).

This hypothesis states that in order to optimize the above two costs, the increase in the third cost is inevitable, making it impossible to have the best of both worlds. LSM trees sacrifice read performance in order to maximize write performance and space utilization. I have already explained the principles of its efficient writing in detail above, so I won’t go into too much detail here.

Some students may find that merge operations will cause space amplification and theoretically waste space. However, due to the immutability of LSM trees, block compression can be introduced to optimize space usage, and there is no need to allocate additional memory (B-trees need extra memory for tree update operations), which allows it to optimize space well.

You should know that the content described by RUM is too simple, and important indicators such as latency and maintainability are not covered. However, it can be used as a concise tool in our toolbox to quickly analyze and grasp the characteristics of a storage engine.

Summary #

So far, we have learned about a typical storage engine used in distributed databases. From its characteristics, it can be seen that its fast writing feature is very attractive to distributed databases, and its KV structure is a data format that sharding likes and is very suitable for building distributed databases. Therefore, distributed databases such as Apache Cassandra, ClickHouse, and TiDB choose LSM trees or similar structured storage engines to build distributed databases.