10 Sparse Index Why High Concurrency Writes Do Not Recommend Relational Databases

Hello, I am Xu Changlong.

Starting from this chapter, we will learn how to optimize systems with a lot of writes and few reads. When we talk about high-concurrency writes, we cannot avoid mentioning the new distributed database HTAP, which combines OLAP and OLTP, and can simultaneously provide data analysis and relational queries.

In fact, the OLAP of HTAP is not big data, or it is not the big data we usually think of as dealing with several terabytes of logs for offline analysis and computation. Here, it refers more to the last step of data mining, which is the scenario of using data mining results for external queries.

For services within this scope, some well-known real-time data statistics and analysis services in the industry include Elasticsearch and ClickHouse. Although their QPS is not high, they can fully utilize system resources to perform statistical filtering and queries on large amounts of data. However, why are relational databases like MySQL not suitable for similar tasks? Let’s analyze it together in this lesson.

B+Tree Index and Data Volume #

We are already familiar with MySQL, and we often use it for business data storage and querying purposes. You may have heard the phrase “don’t exceed 20 million rows in a table.” Why is this?

The core issue lies in MySQL’s indexing, which conflicts with our requirements. Specifically, most of our outward-facing services require real-time processing, which means we need to handle high-concurrency queries and return the data to users within one second. Therefore, our requirements for data size and volume are very high.

MySQL achieves this by using indexes to narrow down the range of data to be scanned and then traversing and filtering the data within that range in the table to obtain the required business data.

In fact, it’s not that MySQL can’t store more data; it’s mostly a limitation of query efficiency.

So, what are the factors that limit query efficiency in MySQL? Take a look at the following diagram:

As we all know, MySQL’s InnoDB database uses B+Tree indexes. The characteristic of B+Tree is that only the real data ID is stored at the lowest level, and the specific content of the data can be retrieved using this ID. Meanwhile, the data at the lowest level of the B+Tree index is stored in the order of the indexed fields.

Through this design, we only need to perform 1-3 I/O operations (the tree depth determines the number of I/O operations) to find the sorted data within the specified range. The efficiency of tree-based indexes is mainly influenced by the depth of the tree and the volume of data (the more unique the data, the smaller the filtered range).

Data volume is easy to understand. As long as our indexed fields are unique enough, the amount of data filtered will be controllable.

But what affects the number of levels in the index tree? This is because MySQL organizes indexes using pages as the storage unit, and each page can only store 16KB (innodb_page_size) of data. If each row of data has a 1KB index, excluding the space taken by the fixed structures of the page, only 16 rows of data can fit on one page. When some branches of the tree cannot accommodate more data, we need to add another level to the index depth.

From this notion of a page, we can deduce that the first level of the index can hold 16 rows, the second level of the tree can hold about 20 million rows, and the third level of the tree can hold about 24 million rows. Therefore, for a three-level deep B+Tree, it takes 3 I/O operations to query data by primary key (one I/O operation for accessing the first-level index in memory, two I/O operations for the index, and one final I/O operation to retrieve the data).

However, the 20 million limit is not absolute. If each row of data is 0.5KB, it will only reach a fourth level of depth after around 40 million rows. For secondary indexes, each page can store 1170 index nodes (8 bytes for the primary key bigint plus 6 bytes for the data pointer), and a three-level deep secondary index can record approximately one billion index records.

As we can see, when the number of rows in our data exceeds three levels, each data operation requires more I/O operations, leading to slower query results. Therefore, many internet systems regularly organize their data to maintain efficient services.

With the above explanation, I believe you already have a visual understanding of the entire querying process. When we perform a query, we use 1-3 I/O operations to find the secondary indexes, and as a result, we obtain a batch of primary key IDs. Then, using MySQL’s MMR algorithm, we sort these IDs and go back to the table to extract the business data on the leaf nodes of the clustered index within the specified value range. We retrieve or process this data individually or in a group and then return the results.

As can be seen, the reason why our commonly used databases are fast lies in how we use indexes. Because processing data using only indexes is not enough, we also need to find the specific data for further processing to obtain the data required for our business. This is why the length of our field data and the volume of data directly affect the responsiveness of our external services.

Please note that we cannot have too many indexes on a table, as having too many indexes will affect the performance of table insertion. Additionally, our queries must adhere to the left-prefix principle to gradually narrow down the scope of data to be retrieved, rather than utilizing multiple CPUs to query index data in parallel. These restrictions greatly limit our ability to handle large amounts of data.

Furthermore, if there is continuous high-concurrency data insertion into the database, it can cause issues such as abnormal operation of the MySQL cluster, slow response from the primary database, and increased lag in master-slave synchronization. In terms of deployment structure, MySQL only has master-slave mode. Therefore, the primary database can only handle large-scale data write operations, and when our data insertion is slow, the client can only wait for the server’s response, severely affecting data insertion efficiency.

After reading this, I hope you understand why relational databases are not suitable for large amounts of data. In fact, OLAP databases may not necessarily be suitable for large amounts of data either. As I mentioned before, many services provided by OLAP also require real-time responses, so most of the time, the data used for calculations in these databases has undergone deep processing. However, even so, there are still many differences in the underlying implementations of OLAP and OLTP.

Let’s first analyze the differences in indexing. OLTP commonly uses B+Tree. As we know, the B+Tree index is a complete tree, and when we have a large amount of data, it will affect the depth of the index tree, and if the depth is too high, it will seriously affect its efficiency. What type of index does OLAP services use for large amounts of data?

LSM Tree and Storage for Sparse Indexing #

Here, I will focus on introducing the LSM index. The first time I encountered the LSM Tree was through RocksDB (and LevelDB). RocksDB has gained rapid popularity and popularity because it leverages the outstanding performance of disk sequential writes and provides write-intensive and read-light KV data storage and query services with small performance costs, which is very different from the storage of relational databases.

To better understand this, let’s talk about how RocksDB implements sparse indexing. The following figure shows the details:

As we mentioned earlier, a B+Tree is a large tree, which is an aggregated whole. Any data additions, deletions, or modifications are performed within this whole, which results in a large number of random read and write I/O operations.

RocksDB LSM is different. It consists of many small trees. When new data is written, it is temporarily stored in memory, enabling a very high write concurrency. When the accumulated data in memory reaches a certain threshold, the data and index in memory are sequentially written to form a data block.

This data block contains a small tree and specific data. The newly generated data block is saved in Level 0 (and there can be several levels, which could be configured). Level 0 has multiple similar data block files. The structure is shown in the following figure:

When the number and size of data blocks in each level reach a certain threshold, RocksDB merges data from different levels, combines the data and index from multiple data blocks, and pushes them to the next level. This way, the number of data blocks and their size in each level can be maintained within a certain range, and the merged data will be more compact and easier to locate.

This design allows a key to exist in multiple levels or data blocks, but the most recent and frequently accessed data is always in the top-most level or in memory (levels 0-4, with 0 being the top-most).

When we query a key, RocksDB will first search in memory. If it is not found, it will search through each level from Level 0 to lower levels, in the order of the newest to the oldest data blocks. To minimize the number of I/O operations, each data block has a Bloom Filter as an auxiliary index to assist in determining whether the corresponding key might exist in this data block. If the key is not found in the current data block, the search can quickly move on to the next data block until the key is found. Of course, the worst-case scenario is traversing all data blocks.

As we can see, this approach sacrifices the consistency of the overall index but gains higher write performance. During reading, we search by traversing all subtrees and reduce the cost of merging trees during writing.

This type of data storage in LSM is commonly used in OLAP databases because OLAP is mostly read-light and write-intensive. When using OLAP to provide data services externally, caching is often used to help the database handle greater read pressure.

Columnar Storage Database #

When it comes to this, we have to mention another difference between OLAP databases and OLTP data. The relational databases we commonly use belong to the row-based storage database. The table data structure is stored in the order of the fields in the table structure. On the other hand, the databases commonly used for big data mining use columnar storage. The reason is that most of the data saved in relational databases are entity attributes and entity relationships, and many queries require every column.

However, real-time data analysis is the opposite. In many cases, a single row is used to represent a user or a main entity (aggregate root), and the columns store information about whether the user or main entity has made a purchase, used a specific app, visited a certain place, owned a particular car, ordered a certain food item, or is from a specific location, etc.

This way of organizing data makes it convenient for data mining and analysis. However, it can also result in a table with hundreds or thousands of fields. If we use a row-based storage engine, we would need to read the data row by row, which would waste a lot of I/O operations.

On the other hand, a columnar storage engine allows us to specify which fields to read, and this approach can make full use of the performance of sequential disk I/O. It greatly improves performance in column-based queries and allows for better data compression. In the field of real-time computing for statistical analysis, columnar storage performs better.

By now, I believe you have noticed that different usage scenarios require different implementations of data storage in order to achieve better performance and cost-effectiveness. As the industry becomes more mature, these needs and characteristics will continue to be explored, summarized, and integrated into our underlying services, gradually reducing our work difficulty and workload.

HTAP #

As explained earlier, OLAP and OLTP databases have their own characteristics and different development directions. In fact, they both provide data query services that are expected to be real-time and fast. The difference lies in how they store and search for indexes.

In recent years, it has become popular to combine both OLAP and OLTP services into a database cluster, providing complementary row-based and column-based databases without affecting each other.

In 2022, domestically produced databases in the industry, such as OceanBase and PolarDB from cloud vendors, are actively starting to support HTAP. This allows us to store the same data and trigger different engines based on the scope of different queries, jointly providing data services externally.

It can be seen that one day in the future, our databases will be able to perform real-time analysis and provide business data services quickly. Gradually, there will be multiple sets of storage and index structures at the underlying data service layer to help us achieve database implementation more conveniently.

Currently, the common way to implement HTAP is to use the same set of data that supports multiple data storage methods (row storage and column storage) within a service cluster. Different indexes are provided for OLAP and OLTP requirements. When querying, users can specify or have the database query engine automatically select which engine to use for query optimization based on SQL and data conditions.

Summary #

In this lesson, we discussed the differences in indexing, storage, data volume, and application scenarios between OLAP and OLTP databases.

Compared to relational databases, OLAP databases store larger amounts of data and have good support for bulk data writing. In many cases, high-concurrency batch data writing is common, and the tables have more fields. The data is stored mostly in a columnar format, and the indexes used are column indexes. Through these methods, real-time queries and analysis of large data computation results can be achieved.

Compared to offline computing, this method is faster and more convenient, but the only drawback is that such services require multiple servers to be distributed, which can be costly.

It can be seen that the different scenarios we use determine how the underlying data is handled more efficiently. The emergence of HTAP allows us to have more choices in different scenarios. After all, big data mining is a huge data management system, and having a lightweight OLAP system would give our business more possibilities.

Thought question #

Finally, please consider: why can columnar storage databases improve OLAP query performance?

Feel free to discuss and communicate with me in the comment section. See you in the next class!