07 What Is a Storage Engine and Why We Need to Understand It

07 What is a Storage Engine and Why We Need to Understand It #

After studying the first module, I believe you already know what a distributed database is and have a comprehensive and in-depth understanding of the core knowledge of distributed databases.

This lecture is an overview of the second module, the storage engine, with the main purpose of explaining what a storage engine is and what role it plays in a distributed database.

One of the primary goals of a database is to reliably and efficiently manage data for people to use. Different applications can use the same database to share their data. The emergence of databases has led people to abandon the idea of developing separate data storage for each independent application. As databases have become widely used, their processing capabilities have rapidly evolved, giving rise to incredible abilities like modern distributed databases.

In order to support various scenarios, general databases often adopt a multi-module or multi-subsystem architecture to build the database. This allows the database project team to combine different sub-modules based on real-life scenarios and construct a variety of rich database products.

The storage engine is a very important component among these modules. Let’s explain its position and significance in the overall database architecture.

Positioning of the storage engine #

There is no absolute rule for database design in the world. Each database evolves into its current form based on the problems it needs to solve and other factors. Therefore, there is no widely accepted standard for the differentiation of database sub-modules, and the boundaries between some modules are also blurry. In particular, when optimizing database performance, existing modules that were designed to exist independently may be integrated to improve the overall performance of the database.

Here, I have summarized a typical architecture and module combination standard for distributed databases. Although it may not represent all distributed databases, it can help you understand how the modules are composed. It is important to note that the model I provide is based on the client/server (C/S) model, which is the architecture pattern of most distributed databases.

  1. Transport layer: It is a layer that receives client requests. It is used to handle network protocols. In a distributed database, it also serves as the communication between nodes.
  2. Query layer: Requests are sent from the transport layer to the query layer. In the query layer, the protocol is parsed, such as SQL parsing; then it is validated and analyzed; finally, it is combined with access control to determine whether the request should be executed. After parsing, the request is sent to the query optimizer, where based on pre-determined rules, data distribution and internal statistics of the database, an execution plan for the request is generated. The execution plan is generally a tree structure that contains a series of related operations to query the desired data from the database.
  3. Execution layer: The execution plan is sent to the execution layer to run. The execution layer generally includes local execution units and remote execution units. According to the execution plan, different units are called, and the results are merged and returned to the transport layer.

You may have noticed that there is only a query layer mentioned here. So how is data written? The answer varies for different databases. Some databases handle it in the transport layer without additional processing since the protocol is simple and the request can be sent directly to the execution layer. Other databases have more complex write operations that are handled by the query layer.

That’s the common way modules are divided in the database field. You may have a question: Where does the storage engine fit in?

The local execution unit in the execution layer is actually the storage engine. It generally includes the following functions:

  1. Transaction manager: Used to schedule transactions and ensure the internal consistency of the database (this is different from the distributed consistency discussed in Module 1);
  2. Lock manager: Ensures consistency when operating on shared objects, including transactions and modifying database parameters;
  3. Storage structure: Contains various physical storage layers, describing how data and indexes are organized on disk;
  4. Memory structure: Mainly includes caching and buffer management. Data is usually input to the disk in batches and is cached in memory before being written;
  5. Commit log: When the database crashes, the commit log can be used to recover the consistency of the system.

The above are some important functions of a storage engine, and its core is to provide data read and write capabilities. Therefore, when designing a storage engine, it usually provides a description of the write and read paths.

Now that you understand the role and main structure of a storage engine, there are also many types of storage engines. Below, I will introduce several typical storage engines based on some key features.

Memory and Disk #

The most important parts of a storage engine are the disk and memory structures. Based on how data is primarily stored in these structures, databases can be classified into memory-based databases and disk-based databases. This shows that one of the functions of a storage engine is to serve as a basis for categorizing databases, highlighting its importance.

Memory-based storage stores data primarily in memory, with the clear goal of speeding up data read and write performance. A significant category of distributed databases is memory-based databases, including Redis, NuoDB, MySQL Cluster, etc. However, the disadvantage is that memory is expensive and has limited capacity. Distributed architecture can effectively expand the capacity of this type of database, which is why memory databases are mainly distributed databases.

Disk storage, on the other hand, is relatively traditional. It stores the main data, while memory is primarily used as a buffer to batch writes. The advantage of disk storage is cost-effectiveness, mainly due to the much lower price per unit of storage of disk or even tape compared to memory. However, compared to memory-based databases, disk-based databases have lower performance. Nevertheless, with the increasing popularity of SSD disks in recent years, this trend has been effectively improved.

The difference between these two storage engines is also reflected in the difficulty of implementing functionalities. Memory-based databases are relatively simple because writing to and freeing random memory spaces are relatively easy processes. On the other hand, disk-based databases require dealing with complex operations such as data referencing, file serialization, and fragmentation management, making implementation more challenging.

From the current development of distributed databases, disk-based storage engines still dominate. In addition to cost-effectiveness, it is expensive to ensure data integrity in memory-based databases because data loss often occurs during power outages. While uninterruptible power supply can be used to ensure stability, it requires complex operations and maintenance.

However, in recent years, with the introduction of technologies such as NVM (Non-Volatile Memory), this situation has started to change. This type of storage provides the performance of DRAM memory and ensures data is not lost after power outage. Moreover, the read and write modes are similar to memory, making it easier for applications to implement functionalities. With its support, memory-based databases will see significant development in the future.

In addition to hardware support, memory-based databases can also ensure data integrity through structural design. The most commonly used method is to use data backup and commit logs. To minimize the impact on read and write performance, the database can asynchronously back up data. At the same time, before writing data, it must be first written to the commit log, meaning that data is considered written only after the commit log is successfully written.

When a database node crashes and recovers, the backup is retrieved, and the differences between the backup and the latest log are calculated. Then, these operations are replayed on the backup to ensure that the database is recovered with the most up-to-date data.

In addition to the choice between memory and disk, storage engines also consider the data organization mode. Now let’s take a look at two common organization modes: row-oriented and column-oriented.

Row-based storage and column-based storage #

Data is generally stored in databases in the form of tables, so there is a concept of rows and columns for all data. However, this is just a logical concept, and the so-called “row-based” and “column-based” that we are going to introduce actually reflect physical concepts.

Row-based storage stores all columns of each row together to form a data file. This data organization is reasonable and efficient when it is necessary to read the entire row of data. However, if a specific column needs to be read from multiple rows, this mode becomes expensive because unnecessary data will also be read.

In contrast, column-based storage stores the same column data from different rows in a data file that is stored nearby. In addition to storing the data itself, it also needs to store which row the data belongs to. Since the order of columns in row-based storage is fixed, no additional information is needed to associate the columns with their values.

Column-based storage is very suitable for processing analytical and aggregation tasks, such as calculating data trends, averages, and so on. Because these tasks generally require loading all rows of a column and do not care about the data of irrelevant columns, it achieves better performance.

We will find that OLTP databases tend to use row-based storage, while OLAP databases tend to use column-based storage. It is the physical characteristics of these two storage modes that lead to this tendency. HATP databases are a fusion of these two storage modes.

Of course, here we need to distinguish between the wide column storage and column storage mentioned by HBase and BigTable. In wide column storage, the columns of the data are first aggregated to column families, which are then stored in different files. The data within the column family is actually organized by rows.

In addition to the distinctions mentioned above, the choice between row-based and column-based storage also needs to consider other features. In modern CPUs, vector instruction sets can process many data of the same type at once, which is exactly the feature of column-based storage. Moreover, storing the same type of data nearby allows for the use of compression algorithms to greatly reduce disk space occupation.

Of course, the most important factor in choosing between these two storage modes is the access pattern. If the data is primarily accessed row by row, such as in transaction scenarios or data management scenarios, row-based storage should be preferred. If frequent queries of all data for aggregation or range scanning are required, then column-based storage is worth considering.

The above are common data organization modes. So how are well-organized data stored on physical devices? Let’s discuss the two common physical components used to store data: data files and index files.

Data files and index files #

As mentioned earlier, the trade-off between memory and disk shows that disk is actually more important because databases provide persistent storage of data. Therefore, let’s start by introducing the two most important types of files on disk: data files and index files.

As the names suggest, data files and index files respectively store raw data and index data used for retrieval.

However, over time, the distinction between the two has become less clear. A trend to merge the two, especially in distributed databases, can be seen with the introduction of index-organized tables (IOT) and other similar data organization patterns. The integration of the two has become unstoppable.

The most traditional form of a data file is a heap-organized table, where data is placed without any particular order and is generally arranged in the order of writing. This type of data file requires additional indexes to help locate the data. In addition, there are two types of data table forms that come with certain indexing capabilities, namely Hash-Organized Table and Index-Organized Table. The former disperses data into a group of data buckets through a hash function, and the data in the bucket is generally sorted according to certain rules to improve query efficiency. The latter generally uses index files to store data. Taking B+ trees as an example, data is stored in leaf nodes. The purpose of this is to reduce the number of disk reads when retrieving data and to be friendly to range scans.

The classification pattern of index files generally consists of primary key indexes and secondary indexes. The former is built on the primary key, which may consist of one or more fields. Other types of indexes are referred to as secondary indexes. Primary key indexes have a one-to-one relationship with the data, while secondary indexes may have a one-to-many relationship, where multiple index entries point to one piece of data.

Based on the degree of integration between indexes and data, we can further divide indexes into clustered indexes and non-clustered indexes. The former, like Hash-Organized Table and Index-Organized Table, have data distribution that is related to index distribution. They are “clustered” together, which improves query efficiency. The latter is most commonly seen in secondary indexes for these two types of data files. Because the columns to be indexed in secondary indexes are not the primary key, the index is separate from the data and requires multiple disk reads during queries. However, for writes, clustered indexes may need to perform unique checks, which can result in lower performance compared to simply constructing non-clustered indexes.

One final point to note is that secondary indexes need to store “references” to the final data. From an implementation perspective, this reference can be the actual location of the data or the primary key of the data. The advantage of the former is higher query efficiency, but writes require updating all indexes, resulting in relatively lower performance. The latter is the opposite; queries need to be mapped through the primary key index, resulting in slightly lower efficiency, but write performance is stable. MySQL, for example, uses the latter as its index mode.

Characteristics of Distributed Storage Engines #

The above content covers some core aspects of storage engines. So, how do distributed databases differ from traditional single-node databases in terms of their storage engine architecture? I have summarized a few key points:

In-memory databases tend to prefer a distributed mode for construction. The reason is quite clear: compared to disk, the capacity of a single machine’s memory is quite limited, so building a distributed database is necessary to meet the capacity requirements of a business.

Columnar storage is also naturally connected to distributed databases. You can research this and find that many open-source projects related to columnar storage are related to platforms like Hadoop. This is because a very important use case for OLAP analytical databases is to analyze all data.

Columnar storage can be considered as an optimization of this mode, and implementing this mode requires a distributed system because the processing capacity of a single machine has its limits. If you want to process extremely large-scale data, then distributing the data across multiple nodes becomes necessary. Therefore, columnar storage became popular due to the need for optimized analytics in distributed environments.

Wide column storage, in particular, is a storage mode used only in distributed database scenarios.

In terms of the organization of data files, distributed databases almost never use heap-organized tables. This is because the format is too arbitrary and cannot effectively distribute data. When you studied the lecture on data sharding, did you notice the natural connection between the names of the two types of organized tables and the two types of sharding algorithms?

Data in hash-organized tables is scattered into different buckets through a hash function, and these buckets can be distributed to different nodes. In index-organized tables, leaf nodes are generally arranged in a certain order, which has a certain level of correspondence with range-based sharding. Therefore, distributed databases generally use these two modes as their storage engines, and some distributed databases directly use data files as indexes.

Conclusion #

Alright, I have covered the topic of storage engines. In this lecture, we first introduced the overall architecture of databases and pointed out the position of storage engines. We then compared several sets of concepts within the storage engines and explained the choices and reasons for distributed databases at the engine level.

Of course, this lecture is only an overview. I will provide you with more detailed explanations of other important concepts within storage engines in the upcoming lectures in this module.