37 When Do We Use Internal Temporary Tables

37 When Do We Use Internal Temporary Tables #

In the [articles 16] and [34], I introduced you to the sort buffer, memory temporary tables, and join buffer. These three data structures are used to store intermediate data during the execution of SQL statements to assist in their execution. Among them, we use the sort buffer for sorting and the join buffer for join statements.

Then you might wonder, when will MySQL use internal temporary tables?

In this article, I will give you two examples of when internal temporary tables are used to show you how they work. Then we will analyze when internal temporary tables are used.

Union Execution Process #

To facilitate quantitative analysis, I will use the table t1 as an example.

    create table t1(id int primary key, a int, b int, index(a));
    delimiter ;;
    create procedure idata()
    begin
      declare i int;
     
      set i=1;
      while(i<=1000)do
        insert into t1 values(i, i, i);
        set i=i+1;
      end while;
    end;;
    delimiter ;
    call idata();

Then, we execute the following statement:

    (select 1000 as f) union (select id from t1 order by id desc limit 2);

This statement uses a union, which means taking the union of the results of these two subqueries. The union means combining these two sets, keeping only one row for duplicate rows.

The figure below shows the explain result of this statement.

img

Figure 1 Union statement explain result

You can see:

  • The key=PRIMARY in the second row indicates that the second subquery uses the id index.
  • The Extra field in the third row indicates that a temporary table is used when taking the union of the result sets of the subqueries.

The execution process of this statement is as follows:

  1. Create a memory temporary table with only one integer field f, and f is the primary key field.
  2. Execute the first subquery, get the value 1000, and store it in the temporary table.
  3. Execute the second subquery: * Get the first row with id=1000 and try to insert it into the temporary table. But since the value 1000 already exists in the temporary table, violating the uniqueness constraint, the insertion fails, and then continue execution; * Get the second row with id=999, and insert it into the temporary table successfully.
  4. Take out the data from the temporary table row by row, return the result, and delete the temporary table. The result contains two rows, 1000 and 999.

The flowchart of this process is shown below:

img

Figure 2 Union execution process

As you can see, the memory temporary table here plays the role of temporary storage for data, and the calculation process also uses the uniqueness constraint of the temporary table’s primary key id to achieve the semantics of union.

By the way, if you change the union in the above statement to union all, there is no “distinct” semantics. When executed in this way, the subqueries are executed one by one, and the results obtained are directly part of the result set sent to the client. Therefore, there is no need for a temporary table.

img

Figure 3 Explain result of union all

As you can see, the Extra field in the second row shows Using index, indicating that only a covering index is used and no temporary table is used.

Group By Execution Process #

Another common example of using temporary tables is group by, let’s take a look at this statement:

    select id%10 as m, count(*) as c from t1 group by m;

The logic of this statement is to group and count the data in the table t1 by id%10, and output them in the order of the m result. Its explain result is as follows:

img

Figure 4 Explain result of group by

In the Extra field, we can see three pieces of information:

  • Using index, indicating that this statement uses a covering index, selects index a, and does not need to return to the table;
  • Using temporary, indicating that a temporary table is used;
  • Using filesort, indicating that sorting is required.

The execution process of this statement is as follows:

  1. Create a memory temporary table with two fields m and c, and the primary key is m.
  2. Scan the index a of table t1, retrieve the id values ​​on the leaf nodes one by one, and calculate the result of id%10, denoted as x; * If there is no row in the temporary table with the primary key x, insert a record (x,1); * If there is a row with the primary key x in the table, add 1 to the c value of the row with x;
  3. After traversing, sort according to the field m, and return the result set to the client.

The execution flowchart of this process is as follows:

img

Figure 5 group by execution process

In the last step of the figure, sorting the memory temporary table, we have already introduced it in [article 17], and I will paste the figure here for your review.

img

Figure 6 Memory temporary table sorting process

Among them, the sorting process of the temporary table is the process within the dashed box in Figure 6.

Next, let’s take a look at the execution result of this statement:

img

Figure 7 group by execution result

If you do not need to sort the result, you can add order by null at the end of the SQL statement, which means changing it to:

    select id%10 as m, count(*) as c from t1 group by m order by null;

This way, the sorting stage is skipped, and the data is directly retrieved from the temporary table and returned. The returned result is shown in Figure 8.

img

Result of group + order by null with in-memory temporary table #

Since the id values in table t1 start from 1, the first row in the result set is id=1. The row with m=0 is only inserted when scanning reaches id=10, so the last row in the result set is m=0.

In this example, since the temporary table only has 10 rows and fits into memory, the entire process only uses an in-memory temporary table. However, the size of the in-memory temporary table is limited, and the tmp_table_size parameter controls this memory size, with a default of 16M.

If I execute the following statement sequence:

set tmp_table_size=1024;
select id%100 as m, count(*) as c from t1 group by m order by null limit 10;

I limit the size of the in-memory temporary table to a maximum of 1024 bytes and modify the statement to id % 100, so that the result contains 100 rows of data. However, the size of the in-memory temporary table is not enough to hold these 100 rows of data. In other words, during execution, the memory allocated for the in-memory temporary table reaches its limit (1024 bytes).

At this point, the in-memory temporary table is converted to a disk temporary table, and the default storage engine for a disk temporary table is InnoDB. The resulting output is shown in figure 9.

img

Figure 9: Result of group + order by null with disk temporary table.

If the table t1 has a large amount of data, it is likely that the disk temporary table required for this query will occupy a significant amount of disk space.

Optimization methods for group by - using indexes #

As can be seen, whether using an in-memory temporary table or a disk temporary table, the group by logic requires constructing a table with a unique index, which incurs a high execution cost. If the table has a large amount of data, the execution of the group by statement above will be slow. So, what optimization methods do we have?

To optimize the group by statement, let’s first consider why it needs a temporary table.

The semantic logic of group by is to count the occurrences of different values. However, since the results of id%100 for each row are unordered, we need a temporary table to record and count the results.

Now, if we can guarantee that the data that appears during the scan is sorted, would things be simpler?

Assuming we have a data structure similar to figure 10, let’s see how group by can be done.

img

Figure 10: Optimizing group by algorithm - sorted input.

As can be seen, if we can ensure that the input data is sorted, then when calculating group by, we only need to scan from left to right and accumulate in order. The process is as follows:

  • When the first occurrence of 1 is encountered, we already know that there are X occurrences of 0, so the first row in the result set is (0,X).
  • When the first occurrence of 2 is encountered, we already know that there are Y occurrences of 1, so the second row in the result set is (1,Y).

If we execute according to this logic, when we finish scanning the entire input, we can obtain the result of group by without the need for a temporary table or additional sorting.

You might have already thought of InnoDB indexes, which can satisfy the condition of having sorted input.

Starting with MySQL 5.7, the generated column mechanism is supported, which is used to implement associated updates to column data. You can use the following method to create a column z and then create an index on column z (if you’re using MySQL 5.6 or earlier versions, you can also create ordinary columns and indexes to solve this problem).

alter table t1 add column z int generated always as(id % 100), add index(z);

In this way, the data on the index z becomes sorted, similar to figure 10. The above group by statement can be changed to:

select z, count(*) as c from t1 group by z;

The EXPLAIN result of the optimized group by statement is shown in the following figure:

img

Figure 11: EXPLAIN result of optimized group by. From the Extra field, we can see that the execution of this statement no longer requires a temporary table or sorting.

Optimization Method for Group By - Direct Sorting #

So, if we can complete the group by logic by adding an index, that would be great. But if we encounter a scenario where creating an index is not suitable, we still have to sort it honestly. So, how can we optimize the group by in this case?

If we know that the amount of data to be placed on the temporary table in a group by statement is particularly large, it seems a bit silly to “put it first on the temporary table in memory, insert some data, and then switch to using a disk-based temporary table”.

So, we would think, does MySQL have a way for us to directly use a disk-based temporary table?

The answer is yes.

By adding the SQL_BIG_RESULT hint to the group by statement, we can tell the optimizer: the amount of data involved in this statement is very large, please use a disk-based temporary table directly.

When the MySQL optimizer sees this, it knows that a disk-based temporary table is stored as a B+ tree, and the storage efficiency is not as high as an array. So, since you told me that the data is large, considering disk space, it is better to use an array directly.

Therefore, the execution flow of the following statement:

select SQL_BIG_RESULT id%100 as m, count(*) as c from t1 group by m;

is as follows:

  1. Initialize the sort_buffer and determine to put an integer field, m, into it.
  2. Scan the index a of table t1, and sequentially retrieve the id values. Store the values of id%100 in the sort_buffer.
  3. After scanning is complete, sort the field m in the sort_buffer (if the sort_buffer does not have enough memory, a disk-based temporary file will be used for sorting).
  4. After sorting is complete, an ordered array is obtained.

Based on the ordered array, we can obtain the different values in the array and the occurrence count of each value. You have already learned about this logic from Figure 10 earlier.

The following two figures are the execution flowchart and the result obtained from executing the explain command.

img

Figure 12: Execution flowchart with SQL_BIG_RESULT

img

Figure 13: Result of executing explain with SQL_BIG_RESULT

From the Extra field, we can see that this statement no longer uses a temporary table for execution, but directly uses the sorting algorithm.

Based on the analysis of the execution process of union, union all, and group by statements above, let’s answer the question raised at the beginning of the article: When will MySQL use internal temporary tables?

  1. If the execution process of the statement can read data and directly obtain the results without requiring additional memory, then no additional memory is needed to store intermediate results.
  2. The join_buffer is an unordered array, the sort_buffer is an ordered array, and the temporary table is a two-dimensional table structure.
  3. If the execution logic requires the use of the two-dimensional table characteristics, a temporary table will be considered first. For example, in our example, the union requires a unique index constraint, and the group by also requires another field to store cumulative counts.

Summary #

Through this article, I have focused on discussing several implementation algorithms for group by, and we can summarize some guiding principles for their usage:

  1. If there is no sorting requirement for the group by statement’s result, add “order by null” to the statement.
  2. Try to use the table’s index during the group by process. Confirm this by checking that the explain result does not show “Using temporary” or “Using filesort”.
  3. If the amount of data needed for group by is not large, try to use only the memory-based temporary table. You can also avoid using the disk-based temporary table by appropriately increasing the tmp_table_size parameter.
  4. If the data volume is too large, use the SQL_BIG_RESULT hint to inform the optimizer to directly use the sorting algorithm to obtain the group by result.

Finally, I’ll leave you with a thought-provoking question.

In figures 8 and 9 in the article, both have “order by null”. Why is the value 0 in the last row of the result set in figure 8, while in figure 9, it is in the first row?

You can write your analysis in the comments, and we will discuss this question in the next article. Thank you for reading, and feel free to share this article with more friends.