34 My SQL Tuning Index the Inefficacy and Optimization of Indexes

34 MySQL Tuning Index The inefficacy and optimization of indexes #

Hello, I am Liu Chao.

I don’t know if you have had the same experience as me as a developer, but I often get “criticized” by DBAs, and the most common thing is to either apply for creating new indexes or identify slow SQL logs.

I remember a previous iteration of a business module development, which involved a new query business. It required filtering orders based on product type and order status, and sorting them by order time. Since the index for SKU already existed, after completing the business development, I submitted a request to create an index for status, with the reason that the SQL query needed to use these two indexes:

select * from order where status =1 and sku=10001 order by create_time asc

However, the DBA quickly rejected this request and advised rebuilding a combined index for SKU, status, and create_time. The query order was also changed to sku=10001 and status=1. At that time, I knew why I needed to rebuild the combined index, but I couldn’t understand why I needed to add the create_time column to the combination.

From the execution plan, we can see that the index is being used, so why does the DBA still require adding the create_time column to the combined index? We mentioned this question in [Lesson 32], and I believe you already know the answer. Through this story, we can see the importance of index knowledge in everyday development, but it is also easy for us to overlook it. So today, let’s talk in detail about indexes.

MySQL Index Storage Structure #

Indexing is one of the most important ways to optimize database queries, and it is implemented at the storage engine level in MySQL, so the index types supported by each storage engine may vary. We can see which index types are supported by different storage engines through the following table:

B+Tree and Hash indexes are two commonly used index data storage structures. B+Tree indexes are implemented using B+ trees and are stored in an ordered manner, so they have advantages in sorting and range queries. If you are not familiar with B+Tree indexes, you can learn about their data structure principles through this link.

Hash indexes are relatively simple and only the Memory storage engine supports Hash indexes. Hash indexes are suitable for key-value pair queries, and regardless of the size of the table data, the complexity of querying data is O(1). The performance of directly querying through Hash indexes is superior to other indexes.

When creating a table, regardless of whether the InnoDB or MyISAM storage engine is used, a primary key index is created by default, and the default index used for the primary key index created is B+Tree index. However, although these two storage engines both support B+Tree indexes, they differ in terms of specific data storage structures.

The primary key index created by InnoDB by default is a clustered index, and other indexes are auxiliary indexes, also known as secondary indexes or non-clustered indexes. Next, we will use a simple example to illustrate the specific implementation of these two types of indexes in storing data.

First, create a table of merchandise as follows:

CREATE TABLE `merchandise`  (
  `id` int(11) NOT NULL,
  `serial_no` varchar(20)  DEFAULT NULL,
  `name` varchar(255) DEFAULT NULL,
  `unit_price` decimal(10, 2) DEFAULT NULL,
  PRIMARY KEY (`id`) USING BTREE
) CHARACTER SET = utf8 COLLATE = utf8_general_ci ROW_FORMAT = Dynamic;

Then add the following lines of data:

If we are using the MyISAM storage engine, since MyISAM uses auxiliary indexes, each leaf node in the index only records the physical address of each row of data, that is, the row pointer, as shown in the following figure:

If we are using the InnoDB storage engine, since InnoDB uses clustered indexes, the leaf nodes of the clustered index record the primary key value, transaction id, rollback pointer for transactions and MVVC, and all remaining columns, as shown in the following figure:

Based on the above diagrams, if we need to query merchandise based on the product code, we need to make the product code column (serial_no) an indexed column. At this time, the created index is an auxiliary index, which is consistent with the storage method of the primary key index in the MyISAM storage engine. However, the leaf nodes store the primary key value, which is used as a pointer to the row. The advantage of this is that when a row is moved or data splitting occurs, the index does not need to be updated.

If we query merchandise using the primary key index, it will find the corresponding leaf node based on the B+ tree index and directly obtain the row data:

select * from merchandise where id=7;

If we query merchandise based on the product code, that is, using the auxiliary index for the query, it will first retrieve the serial_no from the auxiliary index’s B+ tree, find the corresponding leaf node, obtain the primary key value, and then use the B+ tree in the clustered index to retrieve the corresponding leaf node and obtain the entire row of data. This process is called a “lookup”.

After understanding the implementation principles of indexes, let’s now discuss in detail what tuning methods there are for creating and using indexes.

1. Optimizing Queries with Covering Indexes #

Suppose we only need to query the name and price information of a product. What can we do to avoid lookups? We can create a composite index consisting of the product code, name, and price. If these data are present in the index, the query will no longer need to retrieve the primary key index, thus avoiding lookups.

When records are obtained from the secondary index without the need to query the clustered index, it is called a covering index in MySQL. The benefits of using a covering index are obvious: we don’t need to retrieve all the information contained in the entire row, which can reduce a significant amount of I/O operations.

InnoDB usually uses a covering index to optimize query performance, not only when querying specific fields, but also when counting statistics. For example, in [Chapter 32], when we discussed SELECT COUNT(*), if there is no secondary index, the query will use the clustered index to count the number of rows. However, if there happens to be a secondary index, the query will use the secondary index to count the number of rows, reducing I/O operations.

By using EXPLAIN, we can see that the InnoDB storage engine uses the idx_order index column to count the number of rows, as shown in the following figure:

2. Optimizing Queries with Auto-Increment Fields as Primary Keys #

In the previously discussed InnoDB primary key index, the data is stored in the leaf nodes of the B+ tree. This means that the data within the same leaf node is stored in the order of the primary key. Therefore, whenever a new data record is inserted, the database will insert it into the corresponding leaf node based on the primary key.

If we use an auto-increment primary key, each new data record will be added to the current index node in sequence without having to move existing data. When the page is full, a new page will be automatically allocated. Because no data needs to be moved, this method of inserting data is very efficient.

On the other hand, if we use a non-auto-increment primary key, the index value of the primary key for each insert operation is random. Therefore, when inserting new data, it may be inserted at a position in the middle of an existing data page. This will require moving other data to accommodate the new data, and may even involve copying data from one page to another. We commonly refer to this situation as page splitting. Page splitting can also result in a large amount of memory fragmentation, leading to inefficient index structures and affecting query performance.

Therefore, when using the InnoDB storage engine, unless there are specific business requirements, it is recommended to use an auto-increment field as the primary key.

3. Optimizing Prefix Index #

As the name implies, a prefix index uses the first few characters of a string in a field to create an index. Why do we need to use prefixes to build an index?

We know that index files are stored on disk, and the smallest allocation unit on a disk is a page. By default, the size of a page is typically 16KB. Let’s say the size of each index value we create is 2KB. In one page, we can store 8 index values. Assuming we have 8000 rows of records, we would need 1000 pages to store the index. If we use this index to query data, it may require traversing a large number of pages, which clearly reduces query efficiency.

Reducing the size of the index field allows for more index entries to be stored in one page, effectively improving the query speed of the index. When indexing large string fields, using a prefix index can help reduce the size of the index entries.

However, there are limitations to prefix indexes. For example, order by cannot use a prefix index, and a prefix index cannot be used as a covering index.

4. Preventing Index Failure #

When we are used to creating indexes for optimizing the performance of SQL queries, does that mean everything is smooth sailing? Of course not. Sometimes, we may seem to be using indexes, but in reality, they are not being utilized by the optimizer.

For columns implemented with Hash indexes, if range queries are used, the index will not be utilized by the optimizer. In other words, for Hash indexes implemented by the Memory engine, the index will only take effect for query conditions with “=” operators. If we set the order table to use the Memory storage engine and analyze the SQL query with the condition id<10, we can see that the index is not used.

If a LIKE query starts with “%”, it will not be able to utilize the indexed nodes to query data:

When using composite indexes, we need to use the leftmost column in the index for querying in order to utilize the composite index. For example, if we create a composite index idx_user_order_status (order_no, status, user_id) on the order table, we can utilize the index for queries with order_no, order_no+status, order_no+status+user_id, and order_no+user_id. However, if we use status, status+user_id for queries, the index will not be used. This is what we often refer to as the leftmost matching principle.

If the query conditions include an “or” operator and one of the columns in the conditions does not have an index, then the indexes involved will not be used.

So, do you understand? As a developer, if you are not familiar with MySQL, especially the basics of MySQL indexes, you will often be criticized by DBAs and question your life choices.

Summary #

In most cases, we are used to using the default InnoDB as the table storage engine. When using InnoDB as the storage engine, the indexes created are by default B+ tree data structures. If it is a primary key index, it belongs to a clustered index, while non-primary key indexes belong to secondary indexes. When querying based on the primary key, we can directly obtain row information, but when querying based on secondary indexes, we need to perform a look-up operation and then retrieve the data using the primary key index.

If we only need to retrieve information from one or a few columns, we can use a covering index to avoid look-up operations. A covering index only requires reading the index, and because the index is sequentially stored, it greatly reduces disk I/O operations for range or sorting queries.

In addition to understanding the specific implementation and characteristics of indexes, we also need to be aware of cases where the indexes may become ineffective. If you find these rules too many and difficult to remember, it is important to develop the habit of regularly checking the SQL execution plan.

Discussion Questions #

Let’s assume we have an order_detail table, which includes fields such as the primary key id, main order order_id, and product sku. The table has a primary key index and an order_id index.

Now, we have an SQL query to retrieve order details with order numbers ranging from 5000 to 10000. What index will be selected for this query? And how can we force the use of the desired index?

select * from order_detail where order_id between 5000 and 10000;

Answer:

The index selected for this query would be the order_id index, as it covers the range of values specified in the WHERE clause.

To force the use of a specific index, we can employ hints in the SQL query. In this case, we can use the INDEX hint and specify the order_id index to ensure its usage. The modified query would look like this:

select /*+ INDEX(order_detail order_id_index) */ * from order_detail where order_id between 5000 and 10000;

By explicitly mentioning the desired index in the hint, we can control the indexing used for the query.