35 How to Optimize join Statements #

In the previous article, I introduced two algorithms for join statements, Index Nested-Loop Join (NLJ) and Block Nested-Loop Join (BNL).

We found that using the NLJ algorithm was actually quite effective. It was more convenient than splitting the query into multiple statements at the application layer and then concatenating the results. Moreover, it did not result in a performance degradation.

However, when it comes to large table joins, the performance of the BNL algorithm is much worse. The number of comparisons is equal to the product of the number of rows involved in the join for both tables, which consumes a lot of CPU resources.

Of course, there is still room for optimization in both algorithms. Today, let’s talk about this topic.

To facilitate analysis, I will create two tables, t1 and t2, to discuss the problem at hand.

create table t1(id int primary key, a int, b int, index(a));
create table t2 like t1;
drop procedure idata;
delimiter ;;
create procedure idata()
begin
  declare i int;
  set i=1;
  while(i<=1000)do
    insert into t1 values(i, 1001-i, i);
    set i=i+1;
  end while;
  
  set i=1;
  while(i<=1000000)do
    insert into t2 values(i, i, i);
    set i=i+1;
  end while;
 
end;;
delimiter ;
call idata();

To facilitate quantification later, I inserted 1000 rows of data into table t1, with the value of a being 1001 minus the id. In other words, the field a in table t1 is in reverse order. At the same time, I inserted 1 million rows of data into table t2.

Multi-Range Read Optimization #

Before introducing optimization solutions for the join statement, I need to introduce a concept called Multi-Range Read (MRR) optimization. The main purpose of this optimization is to use sequential disk reads as much as possible.

In the fourth article, when I introduced InnoDB’s index structure, I mentioned the concept of “fetch”. Let’s first review this concept. Fetch refers to the process in which InnoDB, after finding the value of the primary key id based on the ordinary index a, then retrieves the complete row data based on each individual primary key id value from the primary key index.

Then, someone asked in the comments section whether the fetch process retrieves data one row at a time or in batches.

Let’s take a look at this issue. Let’s assume that I execute the following statement:

select * from t1 where a>=1 and a<=100;

The primary key index is a B+ tree, and on this tree, each primary key id can only be used to find one row of data. Therefore, the fetch process certainly retrieves data one row at a time as it searches the primary key index. The basic process is demonstrated in Figure 1.

Figure 1 Basic fetch process

If the values of a are queried in increasing order, the values of id become random, resulting in random access, which is relatively poor in performance. Although we cannot change the “fetch” mechanism of retrieving data row by row, we can still speed up the process by adjusting the query order.

Since most data is inserted in increasing order of the primary key, we can assume that if we query in ascending order of the primary key, the disk read is closer to sequential read, thus improving read performance.

This is the design idea behind MRR optimization. At this point, the execution process of the statement is as follows:

Based on the index a, locate the records that satisfy the condition and put the id values into read_rnd_buffer.
Sort the id values in read_rnd_buffer in ascending order.
Use the sorted id array to retrieve records from the primary key id index one by one and return them as the result.

Here, the size of read_rnd_buffer is controlled by the read_rnd_buffer_size parameter. If read_rnd_buffer is filled in step 1, steps 2 and 3 are executed first, and then read_rnd_buffer is cleared. After that, the next record satisfying index a is searched, and the process continues. In addition, it should be noted that if you want to use MRR optimization stably, you need to set set optimizer_switch="mrr_cost_based=off". (According to the official documentation, the current optimizer strategy tends to not use MRR when determining the cost. Setting mrr_cost_based to off means fixed use of MRR.)

The following two figures are the execution process and explain results after using MRR optimization.

Figure 2 MRR execution process

Figure 3 MRR execution process explain result

From the explain result in Figure 3, we can see that the Extra field includes Using MRR, which indicates that MRR optimization is used. Moreover, since we sorted by id in read_rnd_buffer, the resulting set is also in ascending order of the primary key id, which is the reverse order of the rows in Figure 1 result set.

So, let’s summarize.

The core of MRR’s performance improvement lies in the fact that the query statement on index a is a range query (i.e., a multi-value query), which can obtain a sufficient number of primary key ids. Only by sorting and then querying data from the primary key index can the advantage of “sequentiality” be realized.

Batched Key Access #

Understanding the principle of performance improvement with MRR, we can now understand the Batched Key Access (BKA) algorithm introduced by MySQL since version 5.6. This BKA algorithm is actually an optimization of the NLJ algorithm.

Let’s take a look at the flowchart of the NLJ algorithm used in the previous article:

Figure 4 Index Nested-Loop Join process diagram

The logic of the NLJ algorithm is: taking the values of a row by row from the driving table t1, and then joining with the driven table t2. That is, for table t2, it matches one value at a time. In this case, the advantage of MRR cannot be utilized.

So how can we pass multiple values to table t2 all at once? The method is to retrieve a portion of the data from table t1 and temporarily store it in memory. And this temporary memory is none other than the join_buffer.

From the previous article, we know the role of the join_buffer in the BNL algorithm, which is to temporarily store the data of the driving table. However, in the NLJ algorithm, join_buffer is not used. Now we can reuse join_buffer in the BKA algorithm.

As shown in Figure 5, this is the flowchart of the optimized BKA algorithm after NLJ.

Figure 5 Batched Key Access process

In the figure, I put data P1~P100 into the join_buffer, which represents only the fields needed for the query. Of course, if the join buffer cannot hold all the data of P1~P100, the 100 rows of data will be divided into multiple segments and follow the process shown in the figure.

So, how do we enable this BKA algorithm?

If you want to use the BKA optimization algorithm, you need to set before executing the SQL statement:

set optimizer_switch='mrr=on,mrr_cost_based=off,batched_key_access=on';

The purpose of the first two parameters is to enable MRR. This is because the BKA optimization depends on MRR.

Performance Issues of BNL Algorithm #

After talking about the optimization of the NLJ algorithm, let’s take a look at the optimization of the BNL algorithm.

At the end of the previous article, the question I left for you was that when using the Block Nested-Loop Join (BNL) algorithm, multiple scans may be performed on the driven table. If this driven table is a large cold data table, besides causing high IO pressure, what other impacts will it have on the system?

In [Article 33], when we talked about InnoDB’s LRU algorithm, we mentioned that InnoDB optimized the LRU algorithm for the Buffer Pool. That is, when the first data page is read from the disk into memory, it is first placed in the old area. If this data page is not accessed again within 1 second, it will not be moved to the head of the LRU linked list, so it has little impact on the hit rate of the Buffer Pool.

However, if a join statement using the BNL algorithm scans a cold table multiple times, and the execution time of this statement exceeds 1 second, when scanning the cold table again, the data pages of the cold table will be moved to the head of the LRU linked list.

This situation corresponds to the case where the amount of cold table data is less than 3/8 of the entire Buffer Pool and can be fully placed in the old area.

If this cold table is large, another situation will occur: the data pages that are normally accessed have no chance to enter the young area.

Due to the existence of optimization mechanisms, a normally accessed data page needs to be accessed again after a 1-second interval to enter the young area. However, because our join statements read disks and evict memory pages in a loop, the data pages that enter the old area are likely to be evicted within 1 second. This will cause the young area data pages of the Buffer Pool in this MySQL instance to not be reasonably evicted during this period.

That is to say, both of these situations will affect the normal operation of the Buffer Pool. Although the join operation of large tables has an impact on IO, this impact ends after the execution of the statement. However, the impact on the Buffer Pool is continuous and requires subsequent query requests to gradually restore the memory hit rate.

To reduce this impact, you can consider increasing the value of join_buffer_size to reduce the number of scans on the driven table.

In other words, the impact of the BNL algorithm on the system mainly includes three aspects:

It may scan the driven table multiple times, occupying disk IO resources.
Checking the join condition requires executing M*N comparisons (M and N are the number of rows in the two tables), which can consume a lot of CPU resources for large tables.
It may cause the eviction of hot data from the Buffer Pool, affecting the memory hit rate.

Before we execute the statement, we need to confirm whether to use the BNL algorithm through theoretical analysis and examining the explain result. If it is confirmed that the optimizer will use the BNL algorithm, optimization needs to be done. Common optimization practices include adding indexes to the join fields of the driven table to convert the BNL algorithm to the BKA algorithm.

Next, let’s take a closer look at how to perform this optimization.

Conversion from BNL to BKA #

In some cases, we can directly create an index on the driven table, which allows us to convert to the BKA algorithm directly.

However, sometimes you may encounter situations where creating an index on the driven table is not suitable. For example, consider the following query:

select * from t1 join t2 on (t1.b=t2.b) where t2.b>=1 and t2.b<=2000;

At the beginning of the article, we inserted 1 million rows of data into table t2. However, after filtering with the WHERE condition, only 2000 rows of data need to participate in the join. If this statement is also a low-frequency SQL query, creating an index on field b of table t2 would be a waste.

However, if we use the BNL algorithm for the join, the execution process of this statement is as follows:

Retrieve all fields from table t1 and store them in the join_buffer. This table has only 1000 rows, and the default value of join_buffer_size is 256k, which can be completely stored.
Scan table t2 and compare each row with the data in the join_buffer:
- If t1.b is not equal to t2.b, skip;
- If t1.b is equal to t2.b, further check the other conditions, i.e., whether t2.b falls within the range [1,2000]. If it does, include it in the result set; otherwise, skip.

As I mentioned in the previous article, for each row of table t2, we need to traverse all rows in the join_buffer to check if the join condition is satisfied. Therefore, the number of equality condition checks is 1000 * 1 million = 1 billion, which is a significant workload.

Figure 6 explain result

Figure 7 statement execution time

As you can see from the explain result, the Extra field indicates that the BNL algorithm is used. In my test environment, this statement takes 1 minute and 11 seconds to execute.

Creating an index on field b of table t2 would waste resources, but not creating an index means that the equality condition of this statement needs to be checked 1 billion times, which is also wasteful. So, is there a way to achieve the best of both worlds?

At this time, we can consider using a temporary table. The general idea of using a temporary table is as follows:

Put the data that meets the conditions in table t2 into a temporary table tmp_t.
To enable the join to use the BKA algorithm, add an index on field b of the temporary table tmp_t.
Perform the join operation between table t1 and tmp_t.

The corresponding SQL statement is as follows:

create temporary table temp_t(id int primary key, a int, b int, index(b)) engine=innodb;
insert into temp_t select * from t2 where b>=1 and b<=2000;
select * from t1 join temp_t on (t1.b=temp_t.b);

Figure 8 shows the execution effect of this sequence of statements.

图 8 使用临时表的执行效果

The overall execution time of the three statements is less than 1 second, compared to the previous 1 minute and 11 seconds, which shows a significant improvement in performance. Now, let’s take a look at the consumption of this process:

During the execution of the insert statement to construct the temp_t table and insert data, a full table scan was performed on table t2, scanning 1 million rows.
In the following join statement, table t1 was scanned, with a scan of 1,000 rows; during the join comparison process, 1,000 indexed queries were performed. Compared to the join statement before optimization, which required 1 billion conditional checks, this optimization effect is quite significant.

Overall, whether it is adding an index to the original table or using a temporary table with an index, our idea is to let the join statement utilize the index on the driven table to trigger the Block Nested-Loop Join algorithm and improve query performance.

Extension - Hash Join #

At this point, you may have noticed that the operation of calculating 1 billion times seems a bit silly. If the join_buffer maintained not an unordered array but a hash table, then instead of 1 billion checks, it would require 1 million hash lookups. In this case, the execution speed of the whole statement would be much faster, right?

Indeed, that is the case.

This is also one of the reasons why MySQL’s optimizer and executor have been criticized: the lack of support for hash join. And MySQL’s official roadmap has not prioritized this optimization for a long time.

In fact, we can implement this optimization ourselves on the application side. The implementation process is roughly as follows:

Execute select * from t1; to obtain all 1,000 rows of data from table t1 and store them in a hash structure on the application side, such as a set in C++ or an array in PHP.
Execute select * from t2 where b>=1 and b<=2000; to retrieve 2,000 rows of data from table t2 that meet the conditions.
Retrieve these 2,000 rows of data row by row to the application side and search for matching data in the hash structure. Rows that meet the matching criteria are included in the result set.

Theoretically, this process should be faster than the temporary table solution. If you are interested, you can verify it yourself.

Summary #

Today, I shared with you the optimization methods of Index Nested-Loop Join (NLJ) and Block Nested-Loop Join (BNL).

In these optimization methods:

BKA optimization is already built-in and recommended for use by default in MySQL.
The BNL algorithm is inefficient, so it is recommended to convert it to the BKA algorithm whenever possible. The optimization direction is to add an index to the joining fields of the driven table.
The improvement based on temporary tables works well for join statements that can filter out smaller data in advance.
Currently, MySQL does not support hash join, but you can simulate it on the application side, and theoretically, the effect should be better than the temporary table solution.

Lastly, I’ll leave you with a question to ponder.

In both of the join statement articles we’ve discussed, we only dealt with joining two tables. Now, let’s consider a three-table join requirement, assuming that the table structures of these three tables are as follows:

CREATE TABLE `t1` (
 `id` int(11) NOT NULL,
 `a` int(11) DEFAULT NULL,
 `b` int(11) DEFAULT NULL,
 `c` int(11) DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB;

create table t2 like t1;
create table t3 like t2;
insert into ... -- Initialize the data of the three tables

The requirement for the statement is to achieve the following join logic:

select * from t1 join t2 on(t1.a=t2.a) join t3 on (t2.b=t3.b) where t1.c>=X and t2.c>=Y and t3.c>=Z;

Now, in order to achieve the fastest execution speed, if you were asked to design the indexes on tables t1, t2, and t3 to support this join statement, what indexes would you add?

At the same time, if I wanted you to rewrite this statement using straight_join, and with the indexes you created, you would need to arrange the join order. What factors would you consider?

You can write your solution and analysis in the comments, and I will discuss this question with you at the end of the next article. Thank you for listening, and feel free to share this article with more friends to read together.