04 Deep Dive into Indexes Part One #

When it comes to database indexing, I believe you are not unfamiliar with it and often encounter it in your daily work. For example, when a certain SQL query is slow, after analyzing the reason, you might suggest a solution like “add an index to a certain field”. But what exactly is an index and how does it work? Today, let’s discuss this topic together.

There is a lot of content about database indexing, so I have divided it into two articles. Indexing is one of the most important concepts in a database system, so I hope you can read it patiently. In the practical articles that follow, I will also frequently refer to the knowledge mentioned in these two articles to deepen your understanding of database indexing.

In simple terms, the purpose of an index is to improve the efficiency of data queries, just like the table of contents in a book. If you have a 500-page book and want to quickly find a specific piece of knowledge without relying on the table of contents, it would probably take you a while. Similarly, for a table in a database, an index is essentially its “table of contents”.

Common Indexing Models #

The purpose of indexing is to improve query efficiency, but there are many ways to implement indexing, which introduces the concept of indexing models. There are many data structures that can be used to improve read and write efficiency, so let me first introduce three common and relatively simple data structures: hash table, ordered array, and search tree.

Here, I will mainly analyze the differences between these three models from a usage perspective.

A hash table is a data structure that stores data in a key-value format. By inputting the value to be searched, which is the key, you can find its corresponding value. The idea of hashing is simple: put the values in an array and use a hash function to convert the key into a specific position, and then place the value in that position of the array.

Inevitably, multiple key values will be mapped to the same value by the hash function. One way to handle this is to create a linked list.

For example, suppose you are maintaining a table of ID card information and corresponding names, and you need to find the name corresponding to a certain ID card number. The diagram below shows the illustration of the corresponding hash index:

In the diagram, User2 and User4 have the same value “N” calculated based on their ID card numbers, but it’s not a problem because there is a linked list following it. Now, if you want to find the name corresponding to the ID_card_n2, the steps would be: first, calculate “N” using the hash function for ID_card_n2, then traverse in order to find User2.

Please note that the four ID_card_n values in the diagram are not in increasing order. The advantage of this approach is that it is fast to add a new User, just append it to the end. However, the disadvantage is that because it is not sorted, the speed of range queries in hash indexing is slow.

You can imagine that if you now want to find all users whose ID card numbers are in the range [ID_card_X, ID_card_Y], you have to scan all of them.

Therefore, the hash table structure is suitable for scenarios with only equality queries, such as Memcached and some other NoSQL engines.

On the other hand, ordered arrays perform very well in both equality and range query scenarios. Taking the example of searching for a name based on an ID card number as mentioned earlier, if we use an ordered array to implement it, the diagram would be as follows:

Here, let’s assume the ID card numbers are unique, and this array is saved in the order of increasing ID card numbers. Now, if you want to find the name corresponding to ID_card_n2, you can quickly get it using binary search, which has a time complexity of O(log(N)).

It is also obvious that this index structure supports range queries. If you want to find the users whose ID card numbers are in the range [ID_card_X, ID_card_Y], you can first use binary search to find ID_card_X (if ID_card_X does not exist, find the first user greater than ID_card_X), and then traverse to the right until you find the first ID card number greater than ID_card_Y, and then exit the loop.

If we only consider query efficiency, an ordered array is the best data structure. However, it becomes troublesome when you need to update data because you have to move all the records after inserting a record in the middle, which is costly.

Therefore, ordered array indexes are only suitable for static storage engines, such as storing population information for a certain city in 2017, which is data that will not be modified.

Binary search trees are also a classic data structure in textbooks. Using the example of searching for a name based on an ID card number as mentioned before, if we use a binary search tree to implement it, the diagram would be as follows:

The characteristic of a binary search tree is that the value of each node’s left child is less than the parent node, and the parent node is less than the right child. So, if you want to find ID_card_n2, you can follow the search sequence in the diagram: UserA -> UserC -> UserF -> User2. The time complexity is O(log(N)). Of course, in order to maintain a query complexity of O(log(N)), you need to keep the tree balanced. To guarantee this, the time complexity of updates is also O(log(N)).

Trees can be binary or multi-way. In a multi-way tree, each node has multiple children, and the order of the children ensures increasing order from left to right. Binary trees have the highest search efficiency, but in practice, most database storage does not use binary trees. The reason is that indexes exist not only in memory but also need to be written to disk.

Imagine a balanced binary tree with 1 million nodes and a height of 20. A single query may need to access 20 data blocks. In the era of mechanical hard drives, it takes about 10 ms to randomly read a data block from the disk. In other words, for a table with 1 million rows stored using a binary tree, accessing a single row may take 20 times 10 ms, which is quite slow for a query.

In order to minimize disk reads for a query, the query process must access as few data blocks as possible. Therefore, binary trees should not be used, but rather “N-ary” trees. The value of “N” in an “N-ary” tree depends on the size of the data block.

Taking an integer field index in InnoDB as an example, N is approximately 1200. When the height of this tree is 4, it can store up to 1200 to the power of 3 values, which is 1.7 billion. Considering that the root of the tree is always in memory, for an index on an integer field of a table with 1 billion rows, accessing a single value requires a maximum of 3 disk accesses. In fact, there is a high probability that the second level of the tree is also in memory, so the average number of disk accesses is even less.

Due to the performance advantages in terms of read and write operations and the adaptation to disk access patterns, N-ary trees have been widely used in database engines.

Whether it is a hash, an ordered array, or an N-ary tree, they are all products or solutions that are constantly iterated and optimized. With the development of database technology, data structures such as skip lists, LSM trees, etc. have also been used in engine designs, but I won’t go into detail here.

You should have a concept in mind that the core of database storage is based on these data models. When encountering a new database, we need to first focus on its data model in order to analyze the applicable scenarios theoretically.

Up to this point, I have spent half an article introducing you to different data structures and their applicable scenarios. You may find it a bit tedious, but I suggest you spend some more time understanding this part because it is one of the core concepts for database data manipulation, and it is frequently used when analyzing problems. When you understand the model of indexes, you will have a clearer perspective when analyzing problems and will appreciate the ingenuity of engine design.

Now, let’s move on to relatively practical content.

In MySQL, indexes are implemented at the storage engine level, so there is no unified index standard. The working mechanisms of indexes in different storage engines are different. Even if multiple storage engines support the same type of index, their underlying implementations may be different. Since the InnoDB storage engine is the most widely used in MySQL databases, I will use it as an example to analyze its index model.

Index Model in InnoDB #

In InnoDB, tables are stored as index-organized tables, where the data is stored in B+ trees because, as mentioned earlier, InnoDB uses the B+ tree index model.

In InnoDB, each index corresponds to a B+ tree.

Assuming we have a table with a primary key column named ID and a field named k, with an index on k.

The creation statement for this table is:

mysql> create table T(
id int primary key, 
k int not null, 
name varchar(16),
index (k))engine=InnoDB;

For rows R1~R5 in the table, the (ID,k) values are (100,1), (200,2), (300,3), (500,5), and (600,6).

Two tree examples are shown below:

Figure 4 InnoDB Index Organization Structure

From the diagram, it is clear that based on the content of the leaf nodes, indexes can be classified into primary key indexes and non-primary key indexes.

The leaf nodes of the primary key index store the entire row of data. In InnoDB, the primary key index is also known as the clustered index. The content of the non-primary key index leaf node is the value of the primary key. In InnoDB, the non-primary key index is also called a secondary index.

Based on the above index structure explanation, let’s discuss a question: What is the difference between querying based on the primary key index and a regular index?

If the statement is select * from T where ID=500, which is a query based on the primary key, only the ID B+ tree needs to be searched;
If the statement is select * from T where k=5, which is a query based on a regular index, the k index tree needs to be searched first to obtain the value of ID as 500, and then the ID index tree needs to be searched again. This process is called a “ref” or “bookmark lookup”.

In other words, querying based on a non-primary key index requires scanning an extra index tree. Therefore, it is recommended to use primary key queries as much as possible in applications.

Index Maintenance #

To maintain the orderliness of the index, necessary maintenance needs to be performed when inserting new values into the B+ tree. Taking the above diagram as an example, if a new row with an ID value of 700 is inserted, a new record is inserted after the R5 record. If the newly inserted ID value is 400, it becomes more complicated as it requires logically moving the data behind, creating free space.

In an even worse scenario, if the data page where R5 is located is already full, according to the B+ tree algorithm, a new data page needs to be allocated, and then some data needs to be moved to the new page. This process is called page splitting. In this case, performance is naturally affected.

In addition to performance, page splitting also affects the utilization of data pages. Data that was originally stored on a single page is now divided into two pages, resulting in approximately 50% lower overall space utilization.

Of course, where there is splitting, there is also merging. When two adjacent pages have low utilization due to data deletion, the data pages are merged. The process of merging can be considered as the reverse process of splitting.

Based on the above index maintenance process explanation, let’s discuss an example:

You may have come across similar descriptions in some table creation specifications, which require the table creation statement to include an auto-increment primary key. However, nothing is absolute, so let’s analyze which scenarios should use an auto-increment primary key and which should not.

An auto-increment primary key refers to a primary key defined on an auto-increment column, which is generally defined in a table creation statement as: NOT NULL PRIMARY KEY AUTO_INCREMENT.

When inserting new records, it is not necessary to specify the value of the ID column. Instead, the system will retrieve the maximum value of ID and increment it by 1 as the value of the next record’s ID.

In other words, the insertion data mode with an auto-increment primary key fits the scenario of incremental insertion that we mentioned earlier. Each time a new record is inserted, it is an append operation that does not involve moving other records or triggering leaf node splitting.

On the other hand, using a business logic field as the primary key often makes it difficult to guarantee ordered insertion, resulting in higher data writing costs.

Besides considering performance, we can also consider storage space. Suppose your table does have a unique field, such as a string-based ID number. In that case, should you use the ID number as the primary key or use an auto-increment field as the primary key?

Since the leaf nodes of each non-primary key index contain the values of the primary key, if the ID number is used as the primary key, each leaf node of the secondary index will occupy about 20 bytes, whereas if an integer is used as the primary key, it will only occupy 4 bytes, or 8 bytes for a large integer (bigint).

It is obvious that the smaller the primary key length, the smaller the leaf nodes of the regular index, and consequently, the smaller the space occupied by the regular index.

Therefore, from the perspective of performance and storage space, an auto-increment primary key is often a more reasonable choice.

Are there any scenarios where it is suitable to use a business field directly as the primary key? Yes, there are. For example, some business scenarios have the following requirements:

Only one index is needed.
This index must be a unique index.

I’m sure you can see that this is a typical key-value (KV) scenario. Since there are no other indexes, there is no need to consider the size of leaf nodes for other indexes.

At this point, we should prioritize the principle of “use primary key queries as much as possible” mentioned in the previous paragraph and directly set this index as the primary key to avoid having to search two trees for each query.

Summary #

Today, I analyzed the data structures available in database engines and introduced the B+ tree structure used by InnoDB and why InnoDB chose it. B+ trees can work well with the read and write characteristics of disks, reducing the number of disk accesses for each query.

Since InnoDB is an index-organized table, in general, I would recommend creating an auto-increment primary key to minimize the space occupied by non-primary key indexes. However, there are no absolutes, and I also discussed with you the use cases for using business logic fields as primary keys.

Lastly, I want to leave you with a question. For the InnoDB table T in the example above, if you want to rebuild the index k, you can write the two SQL statements like this:

alter table T drop index k;
alter table T add index(k);

If you want to rebuild the primary key index, you can write them like this:

alter table T drop primary key;
alter table T add primary key(id);

My question is, what is your understanding of these two methods of index rebuilding? If they are not appropriate, why not, and what would be a better approach?

You can write your thoughts and opinions in the comments section, and I will provide my reference answer at the end of the next article. Thank you for listening, and feel free to share this article with more friends to read together.

Previous Question Answer #

The question I left you with at the end of the previous article was: How to avoid the impact of long transactions on business?

This question can be approached from both the application development side and the database side.

First, let’s look at it from the application development side:

Check whether set autocommit=0 is used. This can be done in the testing environment by enabling the MySQL general_log and running a business logic randomly to confirm through the log. Usually, if a framework sets this value, it will also provide parameters to control the behavior. Your goal is to change it to 1.
Check if there are unnecessary read-only transactions. Some frameworks tend to put any statement inside a begin/commit block. I have seen cases where the business logic does not require this, but several select statements are still put in the transaction. These read-only transactions can be removed.
When the business connects to the database, based on the estimate of the business itself, control the maximum execution time of each statement using the SET MAX_EXECUTION_TIME command to prevent a single statement from accidentally executing for too long. (Why accidentally? This type of case will be mentioned in future articles.)

Next, let’s look at it from the database side:

Monitor the information_schema.Innodb_trx table and set a long transaction threshold. If it exceeds the threshold, send an alert or kill the transaction.
Percona’s pt-kill tool is good, I recommend using it.
Require the output of all general_log during business function testing, analyze the log behavior, and discover problems in advance.
If you are using MySQL 5.6 or a newer version, set innodb_undo_tablespaces to 2 (or a larger value). If there is indeed a large transaction causing a large rollback segment, it will be easier to clean up after setting this value.