05 Deep Dive Into Indexes Part Two

05 Deep Dive into Indexes Part Two #

In the previous article, I introduced you to the data structure model of InnoDB indexes. Today, let’s continue discussing concepts related to MySQL indexes.

Before we start this article, let’s take a look at this question:

In the table T below, if I execute select * from T where k between 3 and 5, how many tree search operations will be performed and how many rows will be scanned?

Here is the initialization statement for this table.

mysql> create table T (
ID int primary key,
k int NOT NULL DEFAULT 0, 
s varchar(16) NOT NULL DEFAULT '',
index k(k))
engine=InnoDB;
 
insert into T values(100,1, 'aa'),(200,2,'bb'),(300,3,'cc'),(500,5,'ee'),(600,6,'ff'),(700,7,'gg');

img

Figure 1: InnoDB’s index organization structure

Now, let’s take a look at the execution process of this SQL query:

  1. Find the record with k=3 on the k index tree and retrieve ID=300;
  2. Go to the ID index tree and find R3 corresponding to ID=300;
  3. Get the next value on the k index tree, k=5, and retrieve ID=500;
  4. Go back to the ID index tree and find R4 corresponding to ID=500;
  5. Get the next value on the k index tree, k=6, which does not satisfy the condition, and the loop ends.

In this process, the process of returning to the primary key index tree search is called a “covered index”. As you can see, this query process reads 3 records from the k index tree (steps 1, 3, and 5) and returns to the ID index tree twice (steps 2 and 4).

In this example, since the data required for the query result is only available in the primary key index, it is necessary to return to the primary key index tree. So, is it possible to avoid the need for returning to the primary key index tree through index optimization?

Covering Index #

If the executed statement is select ID from T where k between 3 and 5, in this case, only the value of ID is needed, and the value of ID is already on the k index tree. Therefore, the query result can be directly provided without the need to return to the primary key index tree. In other words, in this query, the index k has “covered” our query requirement, which is called a covering index.

Since a covering index can reduce the number of tree searches and significantly improve query performance, using a covering index is a commonly used performance optimization technique.

It should be noted that internally, using a covering index on index k actually reads three records, R3 to R5 (corresponding to the index entries on index k). However, for MySQL’s Server layer, it means getting two records from the engine, so MySQL considers the number of scanned rows to be 2.

Note: Regarding the issue of how to view the number of scanned rows, I will discuss it in detail with you in the 16th article “How to Correctly Display Random Results?”.

Based on the explanation of covering indexes above, let’s discuss a question: Is it necessary to create a composite index on the ID card number and name in a citizen information table?

Let’s assume the definition of this citizen table is as follows:

CREATE TABLE `tuser` (
  `id` int(11) NOT NULL,
  `id_card` varchar(32) DEFAULT NULL,
  `name` varchar(32) DEFAULT NULL,
  `age` int(11) DEFAULT NULL,
  `ismale` tinyint(1) DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `id_card` (`id_card`),
  KEY `name_age` (`name`,`age`)
) ENGINE=InnoDB;

We know that the ID card number is the unique identifier of a citizen. In other words, if there is a need to query citizen information based on the ID card number, we only need to create an index on the ID card number field. Is it a waste of space to create a composite index on (ID card number, name)?

If there is a high-frequency request that needs to query a citizen’s name based on their ID card number, then this composite index is meaningful. It can be used for this high-frequency request as a covering index, eliminating the need to return to the table to retrieve the full row, thus reducing the execution time of the statement.

Of course, maintaining index fields always comes with a cost. Therefore, when creating redundant indexes to support covering indexes, it requires careful consideration. This is exactly the work of a business DBA, or a business data architect.

Leftmost Prefix Principle #

By now, you must have a question: if we design an index for each type of query, won’t there be too many indexes? What if I want to search for a citizen’s home address based on their ID card number? Although this query requirement may not occur frequently in the business logic, we can’t perform a full table scan for it, right? On the other hand, creating an index on (ID card number, address) for a single infrequent request seems like a waste. What should we do?

Here, let me tell you the conclusion first. In the B+ tree index structure, we can utilize the “leftmost prefix” of the index to locate records.

To illustrate this concept intuitively, let’s analyze the (name, age) composite index.

img

Figure 2: Illustration of the (name, age) index

As you can see, the index entries are sorted according to the order of fields appearing in the index definition.

When your logical requirement is to find all people with the name “张三” (Zhang San), you can quickly locate ID4 and then traverse forward to retrieve all the required results.

If you want to find all people whose first character of the name is “张” (Zhang), and your SQL statement condition is where name like '张 %', you can also utilize this index. It will find the first record that satisfies the condition, ID3, and then traverse forward until the condition is no longer met.

As you can see, as long as it meets the leftmost prefix of the index, not just the entire definition of the index, the index can be used to accelerate retrieval. This leftmost prefix can be either the leftmost N fields of a composite index or the leftmost M characters of a string index. Based on the explanation above about the leftmost prefix index, let’s discuss a problem: how to arrange the order of fields in a composite index.

The evaluation criterion here is the reusability of the index. Because the leftmost prefix is supported, when a composite index (a,b) already exists, there is generally no need to create a separate index on column a. Therefore, the first principle is that if by adjusting the order, one index can be maintained less, then this order is often the one to be considered first.

So now that you know, in the problem statement at the beginning of this section, we need to create a composite index (ID number, name) for the high-frequency request, and use this index to support the requirement of “querying address based on ID number”.

Now, what if there are both composite queries and individual queries based on columns a and b respectively? If the query conditions only contain column b, the composite index (a, b) cannot be used. In this case, you have to maintain another index, which means you need to maintain both (a, b) and (b) indexes.

In this case, we need to consider the space principle. For example, in the case of the citizen table mentioned above, if the name field is larger than the age field, then I suggest you create a composite index (name, age) and a single-column index (age).

Index Condition Pushdown #

In the previous section, we mentioned that when the leftmost prefix is satisfied, it can be used to locate records in the index. At this point, you may wonder what will happen to the part that does not satisfy the leftmost prefix?

Let’s use the composite index (name, age) of the citizen table as an example. Suppose there is a requirement: retrieve all boys in the table whose name starts with “Zhang” and age is 10. The SQL statement is written as follows:

mysql> select * from tuser where name like '张 %' and age=10 and ismale=1;

Since you already know the prefix index rule, when searching the index tree using this statement, only “Zhang” can be used to find the first record that meets the condition, which is ID3. Of course, this is not bad, it is much better than a full table scan.

What’s next?

Of course, we need to check if other conditions are met.

Before MySQL 5.6, we could only start from ID3 and iterate through each record one by one. We would need to retrieve the data row from the primary key index and compare the field values.

With the introduction of the index condition pushdown optimization in MySQL 5.6, the index traversal process can directly filter out records that do not satisfy the conditions by checking the fields contained in the index, reducing the number of data retrievals.

Figure 3 and Figure 4 are flowcharts of these two processes.

img

Figure 3: Execution process without index condition pushdown

img

Figure 4: Execution process with index condition pushdown

In Figures 3 and 4, each dashed arrow represents a data retrieval.

In Figure 3, I deliberately removed the age values in the (name, age) index. In this process, InnoDB does not check the age values but sequentially retrieves records that have “Zhang” as the first character of the name and then retrieves the associated data rows. Therefore, it requires 4 data retrievals.

The difference between Figures 3 and 4 is that in Figure 4, InnoDB checks whether the age is equal to 10 within the (name, age) index and directly skips the records that do not satisfy this condition. In our example, only ID4 and ID5 need to be retrieved and checked, resulting in only 2 data retrievals.

Summary #

In today’s article, we continued to discuss the concept of database indexes, including covering indexes, prefix indexes, and index pushing. As you can see, minimizing resource access is an important principle in database design, as long as the statement requirements are met. When using databases, especially when designing table structures, we should aim to minimize resource consumption.

Now, I have a question for you.

In fact, a primary key index can also be composed of multiple fields. When DBA Xiaolu joined a new company, he found a table in the database he was responsible for maintaining, with a table structure similar to this:

CREATE TABLE `geek` (
  `a` int(11) NOT NULL,
  `b` int(11) NOT NULL,
  `c` int(11) NOT NULL,
  `d` int(11) NOT NULL,
  PRIMARY KEY (`a`,`b`),
  KEY `c` (`c`),
  KEY `ca` (`c`,`a`),
  KEY `cb` (`c`,`b`)
) ENGINE=InnoDB;

His colleagues told him that, due to historical reasons, the table needs to have a and b as the composite primary key, and Xiaolu understood that.

However, Xiaolu, who has studied the content of this chapter, was puzzled again. Since the primary key already includes the fields a and b, creating an index on field c alone would already include three fields. Why create the “ca” and “cb” indexes?

His colleague informed him that it was because there are two types of statements in their business logic:

select * from geek where c=N order by a limit 1;
select * from geek where c=N order by b limit 1;

My question for you is, is his colleague’s explanation correct? Are both of these indexes necessary for these two query patterns? Why?

Please write down your thoughts and opinions in the comments, and I will discuss this question with you at the end of the next article. Thank you for listening, and feel free to share this article with more friends to read together.

Previous question #

The previous question was whether it is reasonable to rebuild the index k using two ALTER statements, and whether it is reasonable to rebuild the primary key index using two ALTER statements.

In the comments, some students asked why it is necessary to rebuild the index. As we mentioned in the article, indexes may have empty spaces in data pages due to deletions or page splits. Rebuilding the index will create a new index and insert the data in order, maximizing the utilization of pages and making the index more compact and space-efficient.

The “reference answer” to this question is as follows:

Rebuilding the index k is reasonable and can achieve the goal of saving space. However, the process of rebuilding the primary key is not reasonable. Whether deleting or creating a primary key, the entire table needs to be rebuilt. Therefore, if these two statements are executed in sequence, the first statement becomes meaningless. Instead of these two statements, you can use the following statement: ALTER TABLE T ENGINE=InnoDB. I will analyze the execution process of this statement in the 12th article of this column, “Why is the table file size unchanged when half of the table data is deleted?”