10 Combined Indexes Improve Performance by 10 Times

10 Combined Indexes - Improve Performance by 10 Times #

In the previous two lessons, I taught you about the data structure of indexes and the organization of indexed tables. I believe you have mastered how to create indexes in a MySQL database and some basic usage techniques.

Of course, the examples I showed in the previous two lessons were based on indexing and sorting on a single column, which is relatively simple. In real business scenarios, we often encounter more complex situations, such as querying on multiple columns. In these cases, users may be required to create composite indexes composed of multiple columns, such as a composite index created by columns a and b. However, creating an index on (a, b) or (b, a) will yield completely different results.

In this lesson, we will learn about the creation and usage of composite indexes that are more closely related to practical business scenarios. I hope that after completing this lesson, you will be able to use composite indexes effectively in your own business and further improve the performance of your system.

Composite Index #

A composite index is a B+ tree index composed of multiple columns. This is completely consistent with the principles of B+ tree indexes that we introduced earlier, except that before we sorted the index by a single column, and now we are sorting it by multiple columns.

A composite index can be a primary key index or a secondary index. The following diagram shows a secondary composite index:

Drawing 0.png

B+ Tree structure of a composite index

From the diagram above, we can see that a composite index is still a B+ tree index, the only difference is that the sorting key value has changed from one to multiple. However, you must be aware that a composite index, such as (a, b) and (b, a), will yield completely different sorting results. As the number of indexed fields increases, the design becomes more prone to problems, for example:

Drawing 2.png

For the composite index (a, b), because it sorts columns a and b, it can optimize the following two queries:

SELECT * FROM table WHERE a = ?

SELECT * FROM table WHERE a = ? AND b = ?

In the above SQL queries, the order of querying columns a and b in the WHERE clause is irrelevant. Even if you write b = ? AND a = ? first, the composite index (a, b) can still be used.

However, the following SQL query cannot use the composite index (a, b) because the sorting order of (a, b) cannot deduce the sorting order of (b, a):

SELECT * FROM table WHERE b = ?

In addition, because the index (a, b) is already sorted, the following SQL query can still use the composite index (a, b) to improve query efficiency:

SELECT * FROM table WHERE a = ? ORDER BY b DESC

For the same reason, the index (a, b) sorting cannot deduce the sorting of (b, a), so the following SQL query cannot use the composite index (a, b):

SELECT * FROM table WHERE b = ? ORDER BY a DESC

So far, I have taught you the basic content of composite indexes. Next, let’s take a look at how to design composite indexes correctly in practical business scenarios.

Practical Design of Business Indexes #

Avoid Extra Sorting #

In real business scenarios, you may encounter situations where you need to query based on a certain column and then display the results in reverse order by time.

For example, in the microblogging business, the user’s microblog displays the microblogs subscribed by the user, sorted in reverse order by time. Similarly, in an e-commerce business, the user’s order details page displays the user’s order data based on their ID, sorted in reverse order by purchase time.

Drawing 3.png

The above diagram shows the Taobao order details page from section 05, where the orders are displayed in reverse order by time.

Next, let’s use a set of test tables defined by TPC-H to demonstrate some examples related to indexes (to download the TPC-H defined database tables, please follow the official WeChat account “InsideMySQL” and reply with “tpch” to get the download link for the tables).

TPC-H is a specification definition of a set of test tables used to simulate decision support applications. It simulates a business similar to e-commerce. Let’s take a look at the design of one of the core business tables, “orders” defined by TPC-H:

CREATE TABLE `orders` (

  `O_ORDERKEY` int NOT NULL,

  `O_CUSTKEY` int NOT NULL,

  `O_ORDERSTATUS` char(1) NOT NULL,

  `O_TOTALPRICE` decimal(15,2) NOT NULL,

  `O_ORDERDATE` date NOT NULL,

  `O_ORDERPRIORITY` char(15) NOT NULL,

  `O_CLERK` char(15) NOT NULL,

  `O_SHIPPRIORITY` int NOT NULL,

  `O_COMMENT` varchar(79) NOT NULL,

  PRIMARY KEY (`O_ORDERKEY`),

  KEY `ORDERS_FK1` (`O_CUSTKEY`),

  CONSTRAINT `orders_ibfk_1` FOREIGN KEY (`O_CUSTKEY`) REFERENCES `customer` (`C_CUSTKEY`)

) ENGINE=InnoDB DEFAULT

In this table:

  • The field o_orderkey is an INT type primary key.
  • The field o_custkey is a related field that references the customer table.
  • The fields o_orderdate, o_orderstatus, o_totalprice, o_orderpriority are used to describe the details of the order, representing the order date, the current status of the order, the total price of the order, and the priority of the order, respectively.

With the above order table, when a user wants to view their order information and needs to query it in reverse order based on the order date, the following SQL can be used:

SELECT * FROM orders

WHERE o_custkey = 147601 ORDER BY o_orderdate DESC

However, because the index design of the table structure above only sorts the column O_CUSTKEY, after retrieving data for user 147601, an additional sort is required to obtain the result. This can be verified using the EXPLAIN command:

EXPLAIN SELECT * FROM orders

WHERE o_custkey = 147601 ORDER BY o_orderdate DESC

*************************** 1. row ***************************

           id: 1

  select_type: SIMPLE

        table: orders

   partitions: NULL

         type: ref

possible_keys: ORDERS_FK1

        key: ORDERS_FK1

    key_len: 4

        ref: const

       rows: 19

   filtered: 100.00

      Extra: Using filesort

1 row in set, 1 warning (0.00 sec)

In the above EXPLAIN output, it can be seen that the SQL statement can indeed use the index ORDERS_FK1, but the Extra column shows “Using filesort”, indicating that an additional sorting is required to obtain the final result.

In version 8.0 of MySQL, with the additional option FORMAT=tree in the EXPLAIN command, the observation becomes more explicit:

EXPLAIN FORMAT=tree

SELECT * FROM orders

WHERE o_custkey = 147601 ORDER BY o_orderdate DESC

*************************** 1. row *************************** EXPLAIN: -> Sort: orders.O_ORDERDATE DESC (cost=18.98 rows=19) -> Index lookup on orders using ORDERS_FK1 (O_CUSTKEY=147601)

It can be seen that the execution plan for the above SQL shows an Index lookup for the index query, followed by a Sort for sorting, and finally the result is obtained.

Since an index has been created on the o_custky column, the above SQL statement will not perform particularly slowly. However, in the case of massive concurrent business access, the need for sorting for each SQL execution will have a significant impact on the performance of the business, such as increased CPU load and decreased QPS.

To solve this problem, the best method is to sort the results when retrieving them, so that no additional sorting is required.

For this reason, we create a new composite index idx_custkey_orderdate on the orders table, indexing the fields (o_custkey, o_orderdate):

ALTER TABLE orders ADD INDEX idx_custkey_orderdate(o_custkey,o_orderdate);

Now, when performing the previous SQL statement to display the user’s order information based on time, the execution plan is as follows:

EXPLAIN FORMAT=tree

SELECT * FROM orders

WHERE o_custkey = 147601 ORDER BY o_orderdate

*************************** 1. row *************************** EXPLAIN: -> Index lookup on orders using idx_custkey_orderdate (O_CUSTKEY=147601) (cost=6.65 rows=19)

It can be seen that the optimizer is now using the newly created index idx_custkey_orderdate, and there is no longer a second sorting process.

Avoiding table lookups increases performance by 10 times #

In Lesson 09, I have already explained the concept of a table lookup: that is, the SQL needs to obtain the primary key value through the secondary index query, and then search the primary index based on the primary key value, finally locating the complete data.

However, because the leaf nodes of the secondary composite index contain both the index key value and the primary key value, if the queried fields are in the leaf nodes of the secondary index, the result can be directly returned without a table lookup. This optimization technique, which avoids table lookups through composite indexes, is also known as covering index.

For example, consider the following SQL statement:

EXPLAIN

SELECT o_custkey,o_orderdate,o_totalprice

FROM orders WHERE o_custkey = 147601\G

*************************** 1. row *************************** id: 1 select_type: SIMPLE table: orders partitions: NULL type: ref possible_keys: idx_custkey_orderdate,ORDERS_FK1 key: idx_custkey_orderdate key_len: 4 ref: const rows: 19 filtered: 100.00 Extra: NULL

The execution plan shows that the above SQL will use the previously created composite index idx_custkey_orderdate. However, because the leaf nodes of the composite index only contain (o_custkey, o_orderdate, _orderid) and not the value of the o_totalprice field, a table lookup is required to find the corresponding o_totalprice.

By using the additional option FORMAT=tree in EXPLAIN, we can examine the cost of the execution plan for the above SQL:

EXPLAIN FORMAT=tree

SELECT o_custkey,o_orderdate,o_totalprice

FROM orders WHERE o_custkey = 147601\G

*************************** 1. row *************************** EXPLAIN: -> Index lookup on orders using idx_custkey_orderdate (O_CUSTKEY=147601) (cost=6.65 rows=19)

cost=6.65 represents the current cost of this SQL. Don’t worry about the specific unit of cost, just understand that the smaller the cost, the lower the overhead and the faster the execution speed.

To avoid table lookups, you can use the covering index technique and create a composite index (o_custkey, o_orderdate, o_totalprice), such as:

ALTER TABLE orders ADD INDEX idx_custkey_orderdate_totalprice(o_custkey,o_orderdate,o_totalprice);

And then observe the execution plan again using the EXPLAIN command:

```sql
EXPLAIN 

SELECT o_custkey,o_orderdate,o_totalprice 

FROM orders WHERE o_custkey = 147601\G

*************************** 1. row *************************** id: 1 select_type: SIMPLE table: orders partitions: NULL type: ref possible_keys: idx_custkey_orderdate,ORDERS_FK1,idx_custkey_orderdate_totalprice key: idx_custkey_orderdate_totalprice key_len: 4 ref: const rows: 19 filtered: 100.00 Extra: Using index

As can be seen, the optimizer selected the newly created composite index idx_custkey_orderdate_totalprice, and the Extra column is not NULL but shows “Using index”, indicating that the optimizer used index covering technique.

By observing the cost of executing the SQL again, a significant decrease in cost can be seen, from 6.65 to 2.94:

EXPLAIN FORMAT=tree 

SELECT o_custkey,o_orderdate,o_totalprice 

FROM orders WHERE o_custkey = 147601\G

*************************** 1. row *************************** EXPLAIN: -> Index lookup on orders using idx_custkey_orderdate_totalprice (O_CUSTKEY=147601) (cost=2.94 rows=19)

Let’s see the result of this SQL query:

SELECT o_custkey,o_orderdate,o_totalprice 

FROM orders 

WHERE o_custkey = 147601;

+———–+————-+————–+ | o_custkey | o_orderdate | o_totalprice | +———–+————-+————–+ | 147601 | 1992-05-11 | 109262.70 | | 147601 | 1992-05-20 | 4419.68 | | 147601 | 1993-01-14 | 208550.55 | | 147601 | 1993-07-12 | 309815.22 | | 147601 | 1993-10-15 | 60391.27 | | 147601 | 1994-04-25 | 145497.64 | | 147601 | 1994-08-11 | 130362.83 | | 147601 | 1994-11-11 | 85054.05 | | 147601 | 1994-12-05 | 223393.31 | | 147601 | 1995-03-28 | 220137.39 | | 147601 | 1995-10-05 | 126002.46 | | 147601 | 1996-01-02 | 191792.06 | | 147601 | 1996-02-02 | 180388.11 | | 147601 | 1996-04-13 | 18960.24 | | 147601 | 1996-10-09 | 294150.71 | | 147601 | 1997-01-22 | 19440.08 | | 147601 | 1997-02-18 | 75159.87 | | 147601 | 1997-10-01 | 214565.88 | | 147601 | 1998-02-16 | 131378.46 | +———–+————-+————–+ 19 rows in set (0.00 sec)

It can be seen that the execution returned 19 rows in total. This means that before using index covering technique, this SQL would require a total of 19 table lookups, each time retrieving data from the secondary index and then obtaining the field o_totalprice through the primary key.

After using index covering technique, no table lookups are required, reducing the overhead of 19 lookups.

If you want to see the power of index covering technique, you can execute the following SQL:

SELECT o_custkey,SUM(o_totalprice) 

FROM orders GROUP BY o_custkey;

This SQL represents returning the total amount of each customer’s orders, and the business side can use this result to label customers, select high-value customers, VIP customers, etc.

First, let’s make the created composite index idx_custkey_orderdate_totalprice invisible, and then check the original execution plan:

ALTER TABLE orders 
ALTER INDEX idx_custkey_orderdate_totalprice INVISIBLE;

EXPLAIN SELECT o_custkey,SUM(o_totalprice) 

FROM orders GROUP BY o_custkey

*************************** 1. row ***************************

           id: 1

  select_type: SIMPLE

        table: orders

   partitions: NULL

         type: index

possible_keys:

idx_custkey_orderdate,ORDERS_FK1

          key: ORDERS_FK1

      key_len: 4

          ref: NULL

         rows: 5778755

     filtered: 100.00

        Extra: NULL

EXPLAIN FORMAT=tree 

SELECT o_custkey,SUM(o_totalprice) 

FROM orders GROUP BY o_custkey\G

*************************** 1. row ***************************

EXPLAIN: -> Group aggregate: sum(orders.O_TOTALPRICE)

    -> Index scan on orders using ORDERS_FK1  (cost=590131.50 rows=5778755)

As we can see, this SQL optimization selects the index ORDERS_FK1. However, since this index does not include the field o_totalprice, a table lookup is required, indicating that approximately 5,778,755 table lookups are estimated according to the rows parameter.

Moreover,judging from the FORMAT=tree statement, we can see that the execution cost of this SQL statement is 590,131.5, which is significantly higher compared to the previous individual data table lookup.

Therefore, executing this GROUP BY SQL statement would take around 12.35 seconds.

SELECT o_custkey,SUM(o_totalprice) 

FROM orders GROUP BY o_custkey;

...

399987 rows in set (12.35 sec)

Now let’s compare the execution plan after enabling the index coverage technique:

ALTER TABLE orders 

ALTER INDEX idx_custkey_orderdate_totalprice VISIBLE;

EXPLAIN SELECT o_custkey,SUM(o_totalprice) 

FROM orders GROUP BY o_custkey\G

*************************** 1. row ***************************

           id: 1

  select_type: SIMPLE

        table: orders

   partitions: NULL

         type: index

possible_keys:

idx_custkey_orderdate,ORDERS_FK1,idx_custkey_orderdate_totalprice

          key: idx_custkey_orderdate_totalprice

      key_len: 14

          ref: NULL

         rows: 5778755

     filtered: 100.00

        Extra: Using index

1 row in set, 1 warning (0.00 sec)

This time, the execution plan uses the composite index idx_custkey_orderdate_totalprice and is indicated as using index coverage technique through the “Using index” message.

SELECT o_custkey,SUM(o_totalprice) 

FROM orders GROUP BY o_custkey;

...

399987 rows in set (1.04 sec)

Executing the above SQL statement again, we can see that the execution time has decreased from 12.35 seconds to 1.04 seconds, resulting in a more than 10-fold improvement in SQL performance.

This is the power of index coverage technique, and this is only based on a total of 6 million records in the orders table. The more records there are in the orders table, the more table lookups are required, and the more significant the performance improvement through index coverage technique.

Summary #

In this lecture, I have introduced composite indexes based on previous lectures on indexes.

Composite indexes are also B+ trees, but their columns consist of multiple parts. Composite indexes can be both primary key indexes and secondary indexes. Through today’s learning, we can summarize the three advantages of composite indexes:

  1. Cover multiple query conditions, such as (a, b) index that can cover queries like a = ? or a = ? and b = ?;
  2. Avoid additional sorting in SQL, improving SQL performance, such as queries with conditions like WHERE a = ? ORDER BY b;
  3. Utilize the feature of composite indexes that include multiple columns to achieve index coverage technique, improving SQL query performance. With proper use of index coverage technique, a performance improvement of more than 10 times is achievable.