09 Indexing Organizes Everything

09 Indexing - Organizes Everything #

In the previous lesson, I have introduced you to the basic concept of B+ tree index and how to manage it in MySQL. In order to further understand the specific use of B+ tree index in MySQL, in this lesson I would like to talk to you about the index structure of MySQL InnoDB storage engine.

InnoDB storage engine is the most widely used engine in MySQL database. In massive concurrent OLTP (Online Transaction Processing) businesses, InnoDB is a must. It has a very significant feature in data storage: Index Organized Table.

Now, let me introduce you to the most fundamental concept: Index Organized Table. I hope that after you have studied today’s content, you will be able to understand how MySQL stores data and index objects.

Index Organized Table #

There are two ways to store data: Heap Tables and Index Organized Tables.

In Heap Tables, the data is stored in no particular order, and the sorting of data is completely dependent on indexes (heap table storage structure was the default support for Oracle, Microsoft SQL Server, and PostgreSQL in the early days).

From the diagram, you can see that in the organization structure of heap tables, data and indexes are stored separately. The index is the sorted data, and the data in the heap table is unordered. The leaf nodes of the index store the addresses of the data in the heap table. When the data in the heap table changes and its position changes, all the addresses in the indexes need to be updated. This severely impacts performance, especially for OLTP businesses.

On the other hand, in Index Organized Tables, the data is stored in the index in the order of the primary key, which is also called a clustered index. In Index Organized Tables, data is index, and index is data.

MySQL InnoDB storage engine uses this data organization method; Oracle and Microsoft SQL Server also introduced support for index organized tables in later versions.

However, PostgreSQL database only supports heap table storage, which is not suitable for the access characteristics of OLTP. Although PostgreSQL has made some optimizations to heap tables in later versions, it is essentially trading space for time, and its support for massive concurrent OLTP businesses still has limitations.

Looking back at the User table in Lesson 08, it is organized as an index organized table:

The primary key of the table User is id, so the data in the table is stored in the order of id. The leaf nodes store the complete records of the table. You can see that the data in the table is stored in the index, which means the table is the index, and the index is the table.

After understanding the primary key index storage method of MySQL InnoDB, let’s continue to learn about secondary indexes.

Secondary Index #

Data in the InnoDB storage engine is stored in the order of the primary key index. In addition to the primary key index, all other indexes are called secondary indexes, or non-clustered indexes.

A secondary index is also a B+ tree index, but the leaf nodes store the index key values and primary key values, which is different from the primary key index. For the table User created in Lesson 08, let’s assume that an index idx_name is also created on the column name. This index is a secondary index:

CREATE TABLE User (

    id BIGINT AUTO_INCREMENT,

    name VARCHAR(128) NOT NULL,

    sex CHAR(6) NOT NULL,

    registerDate DATETIME NOT NULL,

    ...

    PRIMARY KEY (id), -- primary key index

    KEY idx_name (name) -- secondary index

)

If the user queries the table based on the column name, such as the following SQL:

SELECT * FROM User WHERE name = 'David',

Using the secondary index idx_name can only locate the primary key values. Additional queries need to be performed using the primary key index in order to obtain the final result. This operation of “querying the secondary index and then querying the primary key index again” is called “index lookup”, as shown in the following diagram:

The design of the secondary index in an index organized table has a significant advantage: if a record is modified, other indexes do not need to be maintained unless the primary key of the record is modified.

When comparing it with the index implementation of heap tables, you will find that in scenarios with a large number of changes, the index organized table has a noticeable performance advantage, as most of the time there is no need to maintain other secondary indexes.

I emphasized earlier that “in index organized tables, data is index, and index is data”. In order to understand the secondary index more easily, you can consider it as another table. For example, the index idx_name can be understood as a table shown below:

CREATE TABLE idx_name (

    name VARCHAR(128) NOT NULL,

    id BIGINT NOT NULL,

    PRIMARY KEY (name, id)

)

The SQL query based on the name can be understood as split into two steps:

SELECT id FROM idx_name WHERE name = ?

SELECT * FROM User WHERE id = _id; -- index lookup

When inserting data, you can consider it as a transaction operation on the primary key index table and the secondary index table. Either both succeed or both fail:

START TRANSACTION;

INSERT INTO User VALUES (...) -- primary key index

INSERT INTO idx_name VALUES (...) -- secondary index
COMMIT;

Of course, for indexes, unique constraints can also be added. Indexes with unique constraints are called unique indexes, which are also referred to as secondary indexes.

For the table User, the column name should have a unique constraint because it is usually required for user registration to have a unique nickname. Therefore, the definition of the User table is updated as follows:

CREATE TABLE User (
    id BIGINT AUTO_INCREMENT,
    name VARCHAR(128) NOT NULL,
    sex CHAR(6) NOT NULL,
    registerDate DATETIME NOT NULL,
    ...
    PRIMARY KEY(id), -- Primary key index
    UNIQUE KEY idx_name(name) -- Secondary index
)

So how should we understand unique indexes for tables? In fact, we can think of constraints as a table or an index. Therefore, the unique index idx_name should be understood as:

CREATE TABLE idx_name (
    name VARCHAR(128) NOT NULL,
    id BIGINT NOT NULL,
    PRIMARY KEY(name,id)
) -- Secondary index

CREATE TABLE check_idx_name (
    name VARCHAR(128),
    PRIMARY KEY(name)
) -- Unique constraint

Now you should understand, right? In an index-organized table, everything is an index, and indexes are data, and data is an index.

Finally, to deepen your understanding of index-organized tables, let’s review the implementation of a heap table.

All indexes in a heap table are secondary indexes, even the primary key index, which means it does not have a clustered index, and each index query requires a lookup to the table. Meanwhile, all records in a heap table are stored in the data file in an unordered manner. For high-concurrency OLTP (Online Transaction Processing) business on the Internet, the implementation of heap tables is indeed “outdated”.

The above is the content about secondary indexes.

Some students may ask: It seems that there is not much difference between secondary indexes and primary key indexes. In order to further understand the overhead of secondary indexes, let’s study the performance evaluation of secondary indexes together.

After learning this part, I hope you can understand why secondary indexes are usually slower than primary key indexes.

Performance Evaluation of Secondary Indexes #

When designing the primary key, you can choose a comparison order, such as incrementing integers, incrementing UUIDs, etc., so the sorting efficiency and insertion performance of primary key indexes are relatively high. However, secondary indexes are different. They may have sequential insertions or completely random insertions. How exactly? Let’s take a look at a table User that is closer to the business:

CREATE TABLE User (
    id  BINARY(16) NOT NULL,
    name VARCHAR(255) NOT NULL,
    sex CHAR(1) NOT NULL,
    password VARCHAR(1024) NOT NULL,
    money BIG INT NOT NULL DEFAULT 0,
    register_date DATETIME(6) NOT NULL DEFAULT CURRENT_TIMESTAMP(6),
    last_modify_date DATETIME(6) NOT NULL DEFAULT CURRENT_TIMESTAMP(6) ON UPDATE CURRENT_TIMESTAMP(6),
    uuid CHAR(36) AS (BIN_TO_UUID(id)),
    CHECK (sex = 'M' OR sex = 'F'),
    CHECK (IS_UUID(uuid)),
    PRIMARY KEY(id),
    UNIQUE KEY idx_name(name),
    KEY idx_register_date(register_date),
    KEY idx_last_modify_date(last_modify_date)
);

As you can see, the table User has three secondary indexes: idx_name, idx_register_date, and idx_last_modify_date.

Usually, it is impossible for the business to require the registration nickname of users to be sequential, so the insertion of the idx_name index is random, which incurs relatively high performance overhead. In addition, the user nickname is usually updatable, but for performance considerations, the business can restrict the number of nickname updates per day or even per year, such as allowing one update per day or three updates per year. The registration time of users is in sequential order, so the performance cost of the index idx_register_date is relatively small. Additionally, once the registration time is inserted, it will not be updated and is only used to identify a registration time.

Regarding idx_last_modify_date, I emphasized in Lecture 03 that in the table structure design of a real business, you must create a column last_modify_date for each core business table to indicate the modification time of each record.

The insertion of the index idx_last_modify_date is similar to idx_register_date in that it is in sequential order. However, the difference is that the index idx_last_modify_date will have frequent update operations. For example, user consumption leads to balance modification and the money field update, which will result in the update of the secondary index.

Since each secondary index includes the primary key value, queries are performed by using the primary key value to perform a callback. Therefore, when designing the table structure, it is recommended to make the primary key value as compact as possible in order to improve the performance of the secondary index. I recommended the design of a 16-byte sequential UUID column in Lecture 05, which is the best practice for performance and storage.

In addition, in actual core business scenarios, developers are likely to design primary keys with business attributes. However, please bear in mind the following two design principles:

Sequential comparison order is friendly to clustered indexes’ performance.
Compactness is friendly to the performance and storage of secondary indexes.

Function Indexes #

So far, our indexes have been created on columns. Starting from MySQL 5.7, MySQL began to support the creation of function indexes (i.e., the index key is a function expression). Function indexes have two main purposes:

Optimize the performance of business SQL queries.
Work with virtual columns (generated columns).

Let’s first look at the first benefit, optimizing the performance of business SQL queries.

We know that not every developer can have a deep understanding of index principles. Sometimes, their table structure design and SQL statement writing may contain “errors.” For example, regarding the User table mentioned above, if some developers want to query users who registered in January 2021, they may mistakenly write the SQL as shown below:

SELECT * FROM User 
WHERE DATE_FORMAT(register_date,'%Y-%m') = '2021-01'

Perhaps the developers think that since an index has been created on register_date, the index can be used for all SQL queries. However, the essence of an index is sorting. The idx_register_date index only sorts the data based on register_date, and it does not sort based on DATE_FORMAT(register_date). Therefore, the above SQL cannot utilize the secondary index idx_register_date.

The database specification requires that functions should be written on the right side of the equation in the query condition, not on the left side. This is the reason.

By using the EXPLAIN command to view the execution plan of the above SQL, we can intuitively see that the index idx_register_date is not being used:

EXPLAIN SELECT * FROM User
WHERE DATE_FORMAT(register_date,'%Y-%m') = '2021-01'

*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: User
   partitions: NULL
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 1
     filtered: 100.00
        Extra: Using where

The correct SQL syntax for the above requirement should be as follows, with the change in the second line where the function DATE_FORMAT is combined into a range query:

EXPLAIN SELECT * FROM User
WHERE register_date > '2021-01-01' 
AND register_date < '2021-02-01'

*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: User
   partitions: NULL
         type: range
possible_keys: idx_register_date
          key: idx_register_date
      key_len: 8
          ref: NULL
         rows: 1

    filtered: 100.00

    Extra: Using index condition

If the online business is not written correctly in SQL, it may cause the database to have many slow query SQL, resulting in slow business or even avalanche scenarios. To solve this problem as soon as possible, you can use function indexes to create an index on DATE_FORMAT(register_date), which allows you to quickly locate the sorted data:

ALTER TABLE User

ADD INDEX

idx_func_register_date((DATE_FORMAT(register_date,’%Y-%m’)));

Next, use the EXPLAIN command to view the execution plan, and you will find that SQL can use the newly created index idx_func_register_date:

EXPLAIN SELECT * FROM User

WHERE DATE_FORMAT(register_date,’%Y-%m’) = ‘2021-01’

*************************** 1. row ***************************

  id: 1

select_type: SIMPLE

    table: User

partitions: NULL

     type: ref

possible_keys: idx_func_register_date

      key: idx_func_register_date

  key_len: 31

      ref: const

     rows: 1

 filtered: 100.00

    Extra: NULL

The function index created above can solve the urgent problem of online business performance, but it is strongly recommended that business developers optimize the SQL in the next version; otherwise, it will result in two indexes on the same data. Indexing requires sorting, and excessive sorting will affect performance.

The second major use of function indexes is in conjunction with virtual columns.

In the previous section on JSON, we have created the table UserLogin:

CREATE TABLE UserLogin (

userId BIGINT,

loginInfo JSON,

cellphone VARCHAR(255) AS (loginInfo->>"$.cellphone"),

PRIMARY KEY(userId),

UNIQUE KEY idx_cellphone(cellphone)

);

The column cellphone is a virtual column, which is calculated by a function expression behind it. This column itself does not occupy any storage space, and the index idx_cellphone is actually a function index. The advantage of doing this is that you can directly use this virtual column in SQL without writing lengthy functions:

– Without virtual column

SELECT * FROM UserLogin

WHERE loginInfo-»"$.cellphone" = ‘13918888888’

– Using virtual column

SELECT * FROM UserLogin

WHERE cellphone = ‘13918888888’

For crawler-based businesses, we first crawl a lot of data from the Internet, some of which are the data we care about, and some are irrelevant. Using virtual column technology, we can display the part of the data that we want, and then create an index on the virtual column to enable fast access and search of the crawled data.

Summary #

In this lesson, we provided a more in-depth introduction to the index section from the previous session. You should now understand that MySQL InnoDB storage engine is an index-organized table, and the differences between index-organized tables and heap tables. To summarize:

The primary key in an index-organized table is a clustered index, and the leaf nodes of the index store the entire row of the table.
Except for the primary key index, all other indexes are secondary indexes, and the leaf nodes of the index store (index key, primary key value).
Since secondary indexes do not store complete records, an additional retrieval is required through the primary key value to locate the complete data.
Compared to heap tables, index-organized tables have better performance in highly concurrent OLTP business scenarios.
The performance overhead on secondary indexes varies for different types of data.
Sometimes, performance issues in SQL can be quickly resolved by using function indexes.
Virtual columns do not occupy actual storage space, and creating indexes on virtual columns is essentially creating function indexes.