11 if Indexing Fails, Understand the Working Principle of Cbo

11 If Indexing Fails, Understand the Working Principle of CBO #

In the previous three lessons, we learned about the principles of B+ tree indexes, the implementation of index-organized tables, and how to use composite indexes. I believe you have gained a certain understanding of using B+ tree indexes.

In actual work, I often encounter some students asking questions like this: MySQL did not select the index as expected. For example, an index was created but a full table scan was performed. This must be a bug in the MySQL database or an index error.

Of course not! This is mainly because there is an issue with the data in the index.

Why do I say that? To understand this problem, we need to understand how the optimizer in the MySQL database works, and then we can understand why the optimizer did not select the index you expected.

Next, let’s understand how MySQL selects indexes.

How does MySQL select indexes? #

In the previous table “orders”, three related indexes have been created for the “o_custkey” field. So the situation of the “orders” table is as follows:

 CREATE TABLE `orders` (

  `O_ORDERKEY` int NOT NULL,

  `O_CUSTKEY` int NOT NULL,

  `O_ORDERSTATUS` char(1) NOT NULL,

  `O_TOTALPRICE` decimal(15,2) NOT NULL,

  `O_ORDERDATE` date NOT NULL,

  `O_ORDERPRIORITY` char(15) NOT NULL,

  `O_CLERK` char(15) NOT NULL,

  `O_SHIPPRIORITY` int NOT NULL,

  `O_COMMENT` varchar(79) NOT NULL,

  PRIMARY KEY (`O_ORDERKEY`),

  KEY `idx_custkey_orderdate` (`O_CUSTKEY`,`O_ORDERDATE`),

  KEY `ORDERS_FK1` (`O_CUSTKEY`),

  KEY `idx_custkey_orderdate_totalprice` (`O_CUSTKEY`,`O_ORDERDATE`,`O_TOTALPRICE`),

  CONSTRAINT `orders_ibfk_1` FOREIGN KEY (`O_CUSTKEY`) REFERENCES `customer` (`C_CUSTKEY`)

) ENGINE=InnoDB

When querying the “o_custkey” field, in theory, any of the three related indexes can be used: ORDERS_FK1, idx_custkey_orderdate, and idx_custkey_orderdate_totalprice. How does the MySQL optimizer choose from these three indexes?

In a relational database, a B+ tree index is just a storage structure. How to use it depends on the optimizer of the database, which determines the specific index to be used, also known as the execution plan.

In MySQL, the cost of a SQL statement can be calculated as follows:

Cost  = Server Cost + Engine Cost

      = CPU Cost + IO Cost

Among them, CPU Cost represents the computational cost, such as the comparison of index key values, record value comparison, result set sorting, etc., all of which are done in the Server layer.

IO Cost represents the IO cost in the Engine layer. In MySQL 8.0, it can distinguish whether the data of a table is in memory and calculate the IO cost of reading from memory and the IO cost of reading from disk separately.

The tables “server_cost” and “engine_cost” in the “mysql” database record the calculation of various costs, such as:

MySQL Execution Process

As shown in the above figure, the MySQL database consists of the Server layer and the Engine layer:

The Server layer contains the SQL parser, SQL optimizer, and SQL executor, which are responsible for the specific execution process of SQL statements.
The Engine layer is responsible for storing the actual data, such as the most commonly used InnoDB storage engine, as well as the TempTable engine used to store temporary result sets in memory.

The SQL optimizer analyzes all possible execution plans and selects the one with the lowest cost to execute. This type of optimizer is called a Cost-based Optimizer (CBO).

In MySQL, the calculation of the cost of a SQL statement is as follows:

Cost  = Server Cost + Engine Cost

      = CPU Cost + IO Cost

The CPU Cost represents the computational overhead, such as the comparison of index key values, record comparison, and sorting of result sets. All of these operations are performed in the Server layer.

The IO Cost represents the IO overhead in the Engine layer. MySQL 8.0 can calculate the IO cost of reading from memory and the IO cost of reading from disk by distinguishing whether the data of a table is in memory.

The table “server_cost” records the cost of various operations performed by the Server layer optimizer, including all CPU costs. The specific meanings are as follows:

disk_temptable_create_cost: The cost of creating a disk-based temporary table, defaulting to 20.
disk_temptable_row_cost: The cost of each record in the disk-based temporary table, defaulting to 0.5.
key_compare_cost: The cost of comparing index key values, defaulting to 0.05, which is the smallest cost.
memory_temptable_create_cost: The cost of creating an in-memory temporary table, defaulting to 1.
memory_temptable_row_cost: The cost of each record in the in-memory temporary table, defaulting to 0.1.
row_evaluate_cost: The cost of comparing records, defaulting to 0.1.

**As you can see, **the MySQL optimizer believes that if a SQL statement needs to create a disk-based temporary table, the cost at this time is the highest, which is 20 times the cost of an in-memory temporary table. The overhead of comparing index key values and records is actually very low, but if the number of records to be compared is very large, the cost will become very high.

The table “engine_cost” records the cost of various operations performed by the storage engine layer, which includes all IO costs. The specific meanings are as follows:

io_block_read_cost: The cost of reading a page from disk, defaulting to 1.
memory_block_read_cost: The cost of reading a page from memory, defaulting to 0.25.

**In other words, **the MySQL optimizer believes that the cost of reading from disk is 4 times the cost of reading from memory.

However, all the costs mentioned above can be modified. For example, if the database is using traditional HDD disks with poor performance, and the random read performance is 50 times slower than reading from memory, you can modify the cost as follows:

INSERT INTO 

engine_cost(engine_name,device_type,cost_name,cost_value,last_update,comment) 

VALUES ('InnoDB',0,'io_block_read_cost',12.5,CURRENT_TIMESTAMP,'Using HDD for InnoDB');

FLUSH OPTIMIZER_COSTS;

Let’s take a look at the GROUP BY SQL statement from Lesson 10. We will use the command EXPLAIN FORMAT=json to view the values of various costs to help you further understand the working principle of optimization.

EXPLAIN FORMAT=json 

SELECT o_custkey,SUM(o_totalprice) 

FROM orders GROUP BY o_custkey

*************************** 1. row ***************************
EXPLAIN: {

“query_block”: {

"select_id": 1,

"cost_info": {

  "query_cost": "626899.50" # Total Cost

},

"grouping_operation": {

  "using_filesort": false,

  "table": {

    "table_name": "orders",

    "access_type": "index",

    "possible_keys": [

      "idx_custkey_orderdate",

      "ORDERS_FK1",

      "idx_custkey_orderdate_totalprice"

    ],

    "key": "idx_custkey_orderdate_totalprice",

    "used_key_parts": [

      "O_CUSTKEY",

      "O_ORDERDATE",

      "O_TOTALPRICE"

    ],

    "key_length": "14",

    "rows_examined_per_scan": 5778755,

    "rows_produced_per_join": 5778755,

    "filtered": "100.00",

    "using_index": true,

    "cost_info": {

      "read_cost": "49024.00", # IO Cost(Engine Cost)

      "eval_cost": "577875.50", # CPU Cost(Server Cost)

      "prefix_cost": "626899.50", # Total Cost

      "data_read_per_join": "2G" # Total read record byte size

    },

    "used_columns": [

      "O_ORDERKEY",

      "O_CUSTKEY",

      "O_TOTALPRICE"

    ]

  }

}

} }

Starting from line 33, where:

read_cost represents the cost of reading from the InnoDB storage engine;
eval_cost represents the CPU cost at the server level;
prefix_cost represents the total cost of this SQL;
data_read_per_join represents the total byte size of records read.

Knowing that MySQL index selection is based on the execution cost of the SQL, we can now analyze some index error cases.

Analysis of MySQL Index Error Cases #

Case 1: Failure to use the created index #

We often hear feedback from students that the MySQL optimizer is inaccurate, unstable, and constantly changing.

However, let me tell you that the MySQL optimizer always selects the most optimal execution plan based on cost. Even for the same SQL statement, if the range is different, the optimizer’s choice may vary.

Consider the following two SQL statements:

SELECT * FROM orders

WHERE o_orderdate > '1994-01-01' and o_orderdate < '1994-12-31';

SELECT * FROM orders 

WHERE o_orderdate > '1994-02-01' and o_orderdate < '1994-12-31';

Both of these SQL statements query using the indexed field o_orderdate. However, the execution plan for the first SQL statement does not use the idx_orderdate index, but instead uses the following execution plan:

EXPLAIN SELECT * FROM orders 

WHERE o_orderdate > '1994-01-01' 

AND o_orderdate < '1994-12-31'\G

*************************** 1. row ***************************
         
           id: 1
           
  select_type: SIMPLE

           table: orders
    
       partitions: NULL
    
             type: ref
    
    possible_keys: idx_orderstatus
    
              key: idx_orderstatus
    
          key_len: 2
    
              ref: const
    
             rows: 1
    
         filtered: 100.00
    
            Extra: Using index condition
    

可以看到，优化器选择了索引 idx_orderstatus 来执行，因为我们只需要查询订单状态为支付中的订单，而这部分数据相对较少，通过索引查询更加高效。

虽然订单状态是一个低选择性的字段，但是根据数据分布情况和实际需求，仍然可以对其创建索引来提升查询效率。优化器会根据实际情况选择最优的执行计划。

table: orders

partitions: NULL

type: ALL

possible_keys: NULL

key: NULL

key_len: NULL

ref: NULL

rows: 5799601

filtered: 50.00

Extra: Using where


Because the field `o_orderstatus` only has three values, which are 'O', 'P', and 'F', MySQL does not know the distribution of these three values and assumes that they are evenly distributed. However, in reality, these three values are severely skewed:

SELECT o_orderstatus,count(1) FROM orders GROUP BY o_orderstatus;

+—————+———-+ | o_orderstatus | count(1) | +—————+———-+ | F | 2923619 | | O | 2923597 | | P | 152784 | +—————+———-+


Therefore, the optimizer will consider that orders with a status of P account for 1/3 of the data, and it will be more efficient to perform a full table scan instead of using a secondary index to avoid the overhead of index lookups.

However, due to data skew, there are very few orders with a status of P, and it will be more efficient to use the index `idx_orderstatus` for query optimization. In this case, we can utilize the histogram feature in MySQL 8.0 to create a histogram that provides the optimizer with the distribution of the data, allowing it to make better choices for execution plans. The command to create a histogram is as follows:

ANALYZE TABLE orders UPDATE HISTOGRAM ON o_orderstatus;


After creating the histogram, MySQL collects the value distribution of the field `o_orderstatus`, which can be queried using the following command:

SELECT v value, CONCAT(round((c - LAG(c, 1, 0) over()) * 100,1), ‘%’) ratio FROM information_schema.column_statistics, JSON_TABLE(histogram->’$.buckets’,’$[*]’ COLUMNS(v VARCHAR(60) PATH ‘$[0]’, c double PATH ‘$[1]’)) hist WHERE column_name = ‘o_orderstatus’;

+——-+——-+ | value | ratio | +——-+——-+ | F | 49% | | O | 48.5% | | P | 2.5% | +——-+——-+


Now MySQL knows that orders with a status of P only account for 2.5% of the data, so when querying orders with a status of P, it will use the index `idx_orderstatus`, as shown in the following example:

EXPLAIN SELECT * FROM orders WHERE o_orderstatus = ‘P’\G

*************************** 1. row *************************** id: 1 select_type: SIMPLE table: orders partitions: NULL type: ref possible_keys: idx_orderstatus key: idx_orderstatus key_len: 4 ref: const rows: 306212 filtered: 100.00 Extra: Using index condition


### Summary

In this lesson, we learned that MySQL optimizer is a cost-based optimizer (CBO), which evaluates the cost of executing each index and selects the optimal execution plan. To summarize:

- MySQL optimizer is a cost-based optimizer (CBO).
- MySQL selects the execution plan with the lowest cost, and you can use the EXPLAIN command to view the cost of each SQL statement.
- Generally, indexes should be created for high selectivity fields and field combinations, while low selectivity fields such as gender should not have indexes.
- If a field has low selectivity but the data is skewed, it may be beneficial to create an index to retrieve a small portion of the data.
- If there is data skew, histograms can be created to provide the optimizer with information about the data distribution in an index, enabling further calibration of the execution plan.