13 Subqueries Use Subquery Features With Ease

13 Subqueries - Use Subquery Features With Ease #

Today I want to talk to you about “subqueries”.

In the previous lecture, I mentioned a complex SQL scenario, which involved joining multiple tables and how to design indexes to improve the performance of JOINs.

In addition to multiple table joins, developers also use subquery statements extensively. However, because previous versions of the MySQL database had limited optimization for subqueries, we used to avoid using subqueries in online transaction processing (OLTP) business scenarios as much as possible.

However, in MySQL 8.0, the optimization for subqueries has been significantly improved. So from now on, you can confidently use subqueries in MySQL.

Why do developers like to write subqueries so much? #

After working for many years, I have found that many developers prefer to write subqueries instead of traditional JOIN statements. Let me give you a simple example. If I asked a developer to “find the number of customers who haven’t placed any orders in 1993”, most of them would use a subquery to write this requirement, like this:

SELECT

    COUNT(c_custkey) cnt

FROM

    customer

WHERE

    c_custkey NOT IN (

        SELECT

            o_custkey

        FROM

            orders

        WHERE

            o_orderdate >= '1993-01-01'

            AND o_orderdate < '1994-01-01'

	);

From this example, you can see that the logic of the subquery is very clear: it queries which users are not in the orders table.

However, the above query is a typical LEFT JOIN problem (i.e., a problem where the customer table exists but the orders table does not). So, if we were to write this problem using a LEFT JOIN, the SQL would look like this:

SELECT

    COUNT(c_custkey) cnt

FROM

    customer

        LEFT JOIN

    orders ON

            customer.c_custkey = orders.o_custkey

            AND o_orderdate >= '1993-01-01'

            AND o_orderdate < '1994-01-01'

WHERE

    o_custkey IS NULL;

You can see that although a LEFT JOIN can also accomplish the above requirement, it is not easy to understand because LEFT JOIN is an algebraic relationship, whereas a subquery is more inclined to be understood from a human perspective.

Therefore, most people tend to write subqueries, even DBAs who deal with databases every day.

However, from the perspective of the optimizer, LEFT JOIN is easier to understand and can perform traditional JOINs between two tables, while a subquery requires the optimizer to intelligently transform it into the optimal JOIN connection.

Let’s take a look at the execution plans for the above two SQL statements in MySQL 8.0:

Drawing 0.png

As you can see, whether it is a subquery or a LEFT JOIN, both are ultimately transformed into Nested Loop Joins, so the execution times of the two SQL statements are the same.

In MySQL 8.0, the optimizer automatically optimizes IN subqueries and transforms them into the best JOIN execution plan. This significantly improves performance.

Which is better in terms of performance: subquery IN or EXISTS? #

Besides “why developers like to write subqueries,” another frequently asked question about subqueries is, “which is better in terms of performance: IN or EXISTS?” To answer this question, let’s look at an example.

For the NOT IN subquery mentioned earlier, you can rewrite it as a NOT EXISTS subquery. The rewritten SQL looks like this:

SELECT

    COUNT(c_custkey) cnt

FROM

    customer

WHERE

    NOT EXISTS (

        SELECT

            1

        FROM

            orders

        WHERE

            o_orderdate >= '1993-01-01'

            AND o_orderdate < '1994-01-01'

            AND c_custkey = o_custkey

    );

You should note that you should not blindly trust some articles on the internet that claim IN has better performance or EXISTS has better performance. You only need to focus on the SQL execution plan. If the execution plans of both are the same, there is no difference in performance.

Going back to the NOT EXISTS mentioned above, its execution plan is shown in the following diagram:

Drawing 0.png You can see that the execution plans for IN and NOT IN subqueries are exactly the same, so their performance is also the same. After discussing the execution plan for subqueries, let’s now look at a type of SQL that requires optimization of subqueries: dependent subqueries.

Optimization of Dependent Subqueries #

Before MySQL 8.0, MySQL did not fully optimize subqueries. So in the execution plan of subqueries, you will see the hint DEPENDENT SUBQUERY, which means it is a dependent subquery that relies on the correlation of external tables.

If you see this hint, you should be cautious, because the execution speed of DEPENDENT SUBQUERY can be very slow. In most cases, you need to manually convert it into a join between two tables.

Let’s take the following SQL as an example:

SELECT
    *
FROM
    orders
WHERE
    (o_clerk , o_orderdate) IN (
        SELECT
            o_clerk, MAX(o_orderdate)
        FROM
            orders
        GROUP BY o_clerk);

The subquery in the above SQL calculates “the last order date for each employee”, and the outermost SQL statement returns the relevant information of orders.

In the latest version of MySQL 8.0, the execution plan of this SQL is as follows:

By using the command EXPLAIN FORMAT=tree to output the execution plan, you can see that there is a note “Select #2 (subquery in condition; run only once)” on the 3rd row. This means that the subquery is only executed once, and then the final result is saved.

The 6th row of the execution plan, “Index lookup on”, indicates the JOIN connection between the table orders and the table obtained from the subquery result, and finally returns the result.

Therefore, the current execution plan involves scanning the table orders twice, with about 5587618 records scanned each time:

The first scan is for the internal subquery operation, which calculates the last transaction time for each employee.
The second scan of the table orders is used to query and return the order information of each employee, that is, the information of the last transaction.

Finally, directly use the command EXPLAIN to view the execution plan, as shown in the following figure:

Drawing 6.png

MySQL 8.0 execution process

If it is an older version of the MySQL database, the execution plan will be a dependent subquery, as shown below:

Drawing 8.png

Execution process in older versions of MySQL

Comparing with MySQL 8.0, the only difference is the value in the select_type on the second row, which is either SUBQUERY or DEPENDENT SUBQUERY.

Then, use the command EXPLAIN FORMAT=tree to view a more detailed execution plan:

Drawing 10.png

You can find that the 3rd row of the execution technique output is “Select #2 (subquery in condition; dependent)”, which is different from the previous execution plan that indicated that the subquery is only executed once. In addition, you can also find from the 1st row that this SQL has become an exists subquery, with each row being associated with the subquery.

Therefore, the above execution plan actually means that it first queries the order information of each employee, and then performs dependent subquery judgment on each record. In other words, it first performs an external table scan, and then does the dependent subquery judgment. So, the subquery was executed 5587618 times, not just once!!!

Therefore, the execution plan of the two is compared as follows in terms of the number of scans:

图片1.png

For the optimization of dependent subqueries, the goal is to avoid situations where the subquery needs to be scanned multiple times due to its dependence on external tables. Therefore, you can use the method of derived tables to connect the external table and the derived table of the subquery to reduce the scan of the subquery table and improve the performance of the SQL query.

So for the above SQL, you can rewrite it as follows:

SELECT * FROM orders o1,
(
    SELECT
        o_clerk, MAX(o_orderdate)
    FROM
        orders
    GROUP BY o_clerk
) o2
WHERE
    o1.o_clerk = o2.o_clerk
    AND o1.o_orderdate = o2.orderdate;

As you can see, we have rewritten the subquery as a derived table o2, and then associated table o2 with the external table orders. The association condition is: o1.o_clerk = o2.o_clerk AND o1.o_orderdate = o2.orderdate. After the above rewrite, the derived table o2 scans the table orders once and returns about 5587618 records. The derived table o1 scans the table orders once and returns about 1792612 records. This is very similar to the execution plan of version 8.0, and its execution plan is as follows:

Drawing 13.png

Finally, let’s look at the execution time of the above SQL:

Drawing 15.png

As you can see, after rewriting the SQL, the derived table’s execution speed is almost the same as that of the independent subquery. Therefore, if you see the execution plan for dependent subqueries, remember to optimize it by rewriting the SQL.

Conclusion #

In this lesson, we learned about the advantages of MySQL subqueries, the optimization of subqueries in the new version of MySQL 8.0, and how to optimize subqueries in older versions of MySQL. After learning today’s content, I hope you will no longer be confused when writing subqueries and use them well in various scenarios.

In summary:

Subqueries are easier for humans to understand compared to JOIN, so they have a wider audience and are used more frequently.
The current version of MySQL 8.0 can “without hesitation” use subqueries, as the optimization of subqueries is already quite complete.
For older versions of MySQL, please review the execution plan of all subqueries, and if you see the hint DEPENDENT SUBQUERY, be sure to optimize it in a timely manner, otherwise, it will have a significant impact on performance.
The optimization of DEPENDENT SUBQUERY is generally to rewrite it as a derived table for table joining. The optimization of table joining is the content we discussed in Lesson 12.