34 Can We Use Join at All

34 Can We Use join at All #

In practical production, there are generally two types of issues related to the use of join statements:

  1. Our DBA does not allow the use of join, what problems are there with using join?
  2. If there are two tables of different sizes to be joined, which table should be used as the driving table?

In this article, I will first explain how join statements are executed, and then answer these two questions.

To facilitate quantitative analysis, I will create two tables t1 and t2 to illustrate with you.

CREATE TABLE `t2` (
  `id` int(11) NOT NULL,
  `a` int(11) DEFAULT NULL,
  `b` int(11) DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `a` (`a`)
) ENGINE=InnoDB;

drop procedure idata;
delimiter ;;
create procedure idata()
begin
  declare i int;
  set i=1;
  while(i<=1000)do
    insert into t2 values(i, i, i);
    set i=i+1;
  end while;
end;;
delimiter ;
call idata();

create table t1 like t2;
insert into t1 (select * from t2 where id<=100)

As you can see, both tables have a primary key index id and an index a, and there is no index on the field b. The stored procedure idata() inserts 1000 rows of data into table t2, and 100 rows of data into table t1.

Index Nested-Loop Join #

Let’s take a look at this statement:

select * from t1 straight_join t2 on (t1.a=t2.a);

If we use the join statement directly, the MySQL optimizer may choose table t1 or t2 as the driving table, which will affect the analysis of the execution process of the SQL statement. To facilitate the analysis of performance issues during the execution process, I will use straight_join to specify a fixed join method for MySQL. In this statement, t1 is the driving table and t2 is the driven table.

Now, let’s take a look at the explain result of this query.

![img](../images/4b9cb0e0b83618e01c9bfde44a0ea990.png)

Figure 1. Explain result of join using indexed fields

As you can see, in this statement, the driven table t2 has an index on the field a, and the join process utilizes this index. Therefore, the execution process of this statement is as follows:

  1. Read one row of data R from table t1.
  2. Take the value of the field a from the data row R and search for it in table t2.
  3. Retrieve the rows in table t2 that satisfy the condition and combine them with R to form a row, which becomes part of the result set.
  4. Repeat steps 1 to 3 until the end of table t1.

This process involves traversing table t1 first, and then searching for records in table t2 based on the values of a obtained from each row of data from table t1. In terms of form, this process is similar to nested queries when we write programs, and it can also utilize the index of the driven table. Therefore, we call this process “Index Nested-Loop Join”, abbreviated as NLJ.

The corresponding flowchart is shown below:

![img](../images/d83ad1cbd6118603be795b26d38f8df6.jpg)

Figure 2. Execution process of the Index Nested-Loop Join algorithm

In this process:

  1. A full table scan is performed on the driving table t1, which requires scanning 100 rows.
  2. For each row R, a tree search process is performed to search for table t2 based on the field a. Since the data we constructed is one-to-one, the search process only scans one row each time, totaling 100 rows.
  3. So, the total number of scanned rows in the entire execution process is 200.

Now that we understand this process, let’s try to answer the two questions posed at the beginning of the article.

Let’s start with the first question: Can we use join?

If we don’t use join, then we can only use single-table queries. Let’s see how we can achieve the same result with a single-table query.

  1. Execute select * from t1 to fetch all the data from table t1, which consists of 100 rows.
  2. Iterate through these 100 rows:
    • Retrieve the value of field a from each row R, denoted as $R.a.
    • Execute select * from t2 where a=$R.a.
    • Combine the returned results with R to form a row in the result set.

As we can see, in this querying process, 200 rows are still scanned, but a total of 101 queries are executed, which includes 100 additional interactions compared to using join. Additionally, the client needs to concatenate SQL statements and results by itself.

Clearly, it is better to use join in this case.

Let’s now address the second question: How do we choose the driving table?

In the execution process of this join statement, the driving table performs a full table scan, while the driven table performs tree-based searches.

Let’s assume the number of rows in the driven table is M. Each fetch of a row from the driven table requires searching both index a and the primary key index. The approximate complexity of searching in a tree is logarithmic as a base 2 of M, denoted as log2M. Therefore, the time complexity of fetching one row from the driven table is 2*log2M.

Let’s assume the number of rows in the driving table is N. The execution process involves scanning N rows from the driving table and performing one match for each row on the driven table.

Hence, the overall execution process has an approximate complexity of N + N2log2M.

Clearly, the impact of N on the number of scanned rows is greater, so we should choose the smaller table as the driving table.

If you don’t find the impact that obvious, you can understand it this way: If N is increased by a factor of 1000, the number of scanned rows will also increase by a factor of 1000, while if M is increased by a factor of 1000, the number of scanned rows will increase by less than a factor of 10.

To summarize the above analysis, we can draw the following conclusions:

  1. Using join statements provides better performance compared to forcefully breaking it into multiple single-table SQL statements.
  2. When using join statements, the smaller table should be chosen as the driving table.

However, please note that these conclusions are based on the premise that the driven table can use indexes.

Next, let’s consider the situation where the driven table cannot utilize indexes.

Simple Nested-Loop Join #

Now, let’s modify the SQL statement as follows:

select * from t1 straight_join t2 on (t1.a=t2.b);

Since the field b in table t2 does not have an index, if we continue to use the algorithm depicted in Figure 2, each time we try to match a row in t2, a full table scan must be performed.

You can imagine this problem for now. If we continue to use the algorithm depicted in Figure 2, can we still obtain the correct result? If we only consider the result, this algorithm is correct, and it also has a name: “Simple Nested-Loop Join.”

However, with this approach, the SQL query would need to scan table t2 for up to 100 times, resulting in a total scan of 100*1000=100,000 rows.

This is just for two small tables. If t1 and t2 both have 100,000 rows (which is still considered small), a total of 10 billion rows would need to be scanned. This algorithm seems too “heavy.”

Of course, MySQL does not use this Simple Nested-Loop Join algorithm. Instead, it uses another algorithm called “Block Nested-Loop Join,” or BNL for short.

Block Nested-Loop Join #

In this case, if the driven table does not have a usable index, the algorithm’s process is as follows:

  1. Read the data from table t1 into the thread-local memory join_buffer. Since the statement is select *, the entire table t1 is loaded into memory.
  2. Scan table t2 and compare each row from table t2 with the data in the join_buffer. If a row satisfies the join condition, it is included in the result set.

A flowchart of this process is illustrated in the following figure:

img

Figure 3: Process flow of the Block Nested-Loop Join algorithm In accordance, the explain result of this SQL statement is as follows:

img

Figure 4: explain result of join without using indexed fields

From the results, we can see that both tables t1 and t2 are scanned in full, resulting in a total of 1100 rows being scanned. Since join_buffer is organized as an unordered array, for each row in table t2, 100 comparisons need to be made, resulting in a total of 100,000 comparisons in memory.

As previously mentioned, if Simple Nested-Loop Join algorithm is used for the query, the number of scanned rows would still be 100,000. Therefore, in terms of time complexity, these two algorithms are the same. However, the Block Nested-Loop Join algorithm’s 100,000 comparisons are done in memory, making it much faster and more efficient.

Next, let’s see which table should be chosen as the driving table in this situation.

Assuming the number of rows in the smaller table is N and the number of rows in the larger table is M, in this algorithm:

  1. Both tables are fully scanned, resulting in a total of M+N scanned rows.
  2. The number of comparisons in memory is M*N.

As we can see, switching M and N in these two calculations doesn’t make a difference, so the execution time is the same regardless of whether the larger or smaller table is chosen as the driving table.

Now, you might immediately ask: “In this example, table t1 has only 100 rows. What if table t1 is a large table that cannot fit into the join_buffer?”

The size of join_buffer is determined by the join_buffer_size parameter, with a default value of 256k. If the join_buffer cannot accommodate all the data in table t1, the strategy is simple: divide and conquer. Let’s change the join_buffer_size to 1200 and execute the following query:

select * from t1 straight_join t2 on (t1.a=t2.b);

The execution process would then be as follows:

  1. Scan table t1 and sequentially read the data rows into join_buffer. After the 88th row, the join_buffer is full, and the process continues to step 2.
  2. Scan table t2 and compare each row with the data in join_buffer. Those that meet the join condition are returned as part of the result set.
  3. Clear the join_buffer.
  4. Continue scanning table t1 and sequentially read the remaining 12 rows of data into join_buffer, and then continue with step 2.

The execution flowchart would look like this:

img

Figure 5: Block Nested-Loop Join - Two Segments

Steps 4 and 5 in the diagram represent clearing and reusing the join_buffer.

This process demonstrates the origin of the term “Block” in the algorithm name, as it involves “joining in blocks”.

As we can see, in this case, table t1 is divided into two sections and stored in join_buffer, resulting in table t2 being scanned twice. Although the data was divided into two sections and stored in join_buffer, the number of equality condition comparisons remains the same: (88+12)*1000=100,000.

Now, let’s revisit the question of which table should be chosen as the driving table in this situation.

Assuming the number of data rows in the driving table is N and the algorithm requires K segments to complete, and M represents the number of data rows in the driven table.

Note that K is not a constant; as N increases, K also increases. Therefore, we can represent K as λ*N, where λ ranges between 0 and 1.

During the execution of this algorithm:

  1. The number of scanned rows is N+λNM.
  2. The number of in-memory comparisons is N*M.

Clearly, the number of in-memory comparisons is not affected by the choice of the driving table. Considering the number of scanned rows, when the sizes of M and N are fixed, choosing a smaller N will result in a smaller value for the entire equation.

Therefore, the conclusion is to choose the smaller table as the driving table.

Of course, you may notice that in the equation N+λNM, λ is the key factor affecting the number of scanned rows, and smaller values for λ are better.

Earlier, we mentioned that as N increases, the number of segments K also increases. So, when N is fixed, what parameter affects the size of K (i.e., the size of λ)? The answer is join_buffer_size. The larger the join_buffer_size, the more rows can be stored in one go, resulting in fewer segments and fewer full table scans for the driven table.

This is why you may come across suggestions to increase the join_buffer_size if your join statement is slow.

Now that we understand the two join algorithms used by MySQL, let’s attempt to answer the two questions raised at the beginning of the article.

The first question: Can a join statement be used?

  1. If the Index Nested-Loop Join algorithm can be used, meaning that an index on the driven table can be utilized, then there shouldn’t be a problem.
  2. If the Block Nested-Loop Join algorithm is used, the number of scanned rows may be excessive. Especially for join operations on large tables, it may require multiple full table scans of the driven table and consume a significant amount of system resources. Therefore, it is best to avoid using this type of join whenever possible.

So, when you judge whether to use a join statement, you need to look at the “Block Nested Loop” phrase appearing in the Extra field in the explain result. #

The second question is: Should you choose the larger table as the driving table or the smaller table as the driving table when using a join?

  1. If it is the Index Nested-Loop Join algorithm, you should choose the smaller table as the driving table.
  2. If it is the Block Nested-Loop Join algorithm:
    • When the join_buffer_size is large enough, it doesn’t matter.
    • When the join_buffer_size is not large enough (which is more common), you should choose the smaller table as the driving table.

Therefore, the conclusion to this question is that you should always use the smaller table as the driving table.

Of course, let me explain what is meant by the “small table”.

In the previous example, there were no conditions added to the statement. If I add the condition t2.id <= 50 to the WHERE clause of the statement and take a look at these two statements:

select * from t1 straight_join t2 on (t1.b=t2.b) where t2.id <= 50;
select * from t2 straight_join t1 on (t1.b=t2.b) where t2.id <= 50;

Note that in order to prevent both joined tables from using indexes, the join fields use the non-indexed field b.

However, if using the second statement, the join_buffer only needs to hold the first 50 rows of t2, which is obviously better. So here, the “first 50 rows of t2” is the relatively smaller table, also known as the “small table”.

Let’s look at another example:

select t1.b, t2.* from t1 straight_join t2 on (t1.b = t2.b) where t2.id <= 100;
select t1.b, t2.* from t2 straight_join t1 on (t1.b = t2.b) where t2.id <= 100;

In this example, both table t1 and t2 only have 100 rows participating in the join. However, the data being put into the join_buffer for these two statements is different:

  • Table t1 only needs to retrieve the b column, so if t1 is put into the join_buffer, only the values of b need to be stored.
  • Table t2 needs to retrieve all columns, so if table t2 is put into the join_buffer, it needs to store the values of three columns: id, a, and b.

Here, we should choose table t1 as the driving table. In this example, the table “t1 that only needs one column to participate in the join” is the relatively smaller table.

So, to be more precise, when determining which table to use as the driving table, each table should be filtered according to their respective conditions. After the filtering is done, calculate the total data volume of each field participating in the join. The table with the smaller data volume is the “small table” and should be used as the driving table.

Conclusion #

Today, I introduced to you the two possible algorithms for executing join statements in MySQL, which are determined by whether the index of the driven table can be used. Whether the index of the driven table can be used has a big impact on the performance of join statements.

By analyzing the execution processes of the Index Nested-Loop Join and Block Nested-Loop Join algorithms, we also obtained the answers to the two questions at the beginning of the article:

  1. If the index of the driven table can be used, join statements still have their advantages.
  2. If the index of the driven table cannot be used and only the Block Nested-Loop Join algorithm can be used, it is best to avoid using such statements.
  3. When using join, the smaller table should be chosen as the driving table.

Finally, it’s time for the question from the previous article.

In the previous article, I mentioned that, when using the Block Nested-Loop Join algorithm, it may require multiple full table scans of the driven table if the join_buffer is not large enough.

My question is, if the driven table is a large table and a cold data table, besides potentially causing high IO pressure during the query process, do you think there are any other significant impacts on this MySQL service? (This question needs to be considered in conjunction with the knowledge points in the previous article)

Feel free to write your conclusions and analysis in the comments. I will discuss this question with you at the end of the next article. Thank you for listening, and feel free to share this article with more friends to read together.

Previous Question Time #

The question I left at the end of the previous article was: What serious impact does it have on the server if the client cannot receive data due to excessive pressure?

The core of this question is the creation of “long transactions”.

As for the impact of long transactions, we need to consider the knowledge points we mentioned earlier about locks and MVCC.

  • If previous statements have updates, it means they are holding row locks, which can cause other statements’ updates to be blocked.
  • As for read transactions, it has an issue as well: it causes the undo log to not be recycled, leading to the inflation of rollback segment space.