12 Join Can or Can’t We Write Join

12 JOIN - Can or Can’t We Write JOIN #

In the previous lessons, I have taught you about the principles of indexes and optimizers, and I believe you can now design and optimize SQL statements for single tables. However, besides single-table SQL statements, there are two more complex types of SQL statements: multi-table JOIN and subquery statements, which require creating indexes on multiple tables, which is relatively more difficult.

Many developers instinctively believe that JOIN will lower the performance efficiency of SQL, so they split a multi-table SQL into separate queries on single tables. However, this actually affects the execution efficiency of SQL. The reason for this is that developers do not understand the implementation process of JOIN.

Next, let’s focus on the working principle of JOIN and then understand the algorithms and application scenarios of JOIN, so that you can confidently use JOIN.

JOIN Connection Algorithm #

MySQL 8.0 supports two JOIN algorithms for associating tables:

Nested Loop Join;
Hash Join.

It is generally believed that in OLTP businesses, since the data volume for queries is small and the statements are relatively simple, indexes are mostly used to associate data between tables. In this case, the optimizer usually uses the Nested Loop Join algorithm. For OLAP businesses, the data volume for queries is large, and there are a large number of associated tables, so the Hash Join algorithm is used, which is more efficient with direct full table scans.

Note that here we only discuss the JOIN connection algorithms in the latest MySQL 8.0 version, and it is also recommended that you use MySQL 8.0 as a priority in the production environment.

Next, let’s analyze these two algorithms: Nested Loop Join and Hash Join.

Nested Loop Join #

The association between tables in Nested Loop Join is performed using indexes. Assuming tables R and S are being joined, the pseudo code for the algorithm is as follows:

for each row r in R with matching condition:

    lookup index idx_s on S where index_key = r

    if (found)

      send to client

In the above algorithm, table R is called the driving table. The data filtered from table R by the WHERE condition is searched one by one in the index corresponding to table S. If the data volume of the driving table R is not large, the above algorithm is very efficient.

Next, let’s look at which table is the driving table for the following three types of JOIN:

SELECT ... FROM R LEFT JOIN S ON R.x = S.x WEHRE ...

SELECT ... FROM R RIGHT JOIN S ON R.x = S.x WEHRE ...

SELECT ... FROM R INNER JOIN S ON R.x = S.x WEHRE ...

For the Left Join mentioned above, the driving table is the left table R; for the Right Join, the driving table is the right table S. It is determined by the JOIN type that the data of the left table or right table must be queried. However, for an INNER JOIN, the driving table could be table R or table S.

In this scenario, the fewer data needs to be queried, the table becomes the driving table. Let’s look at the following example:

SELECT ... FROM R INNER JOIN S 

ON R.x = S.x 

WHERE R.y = ? AND S.z = ?

The above SQL statement performs an INNER JOIN on table R and table S, with the column x being the association column, and the WHERE filter conditions filtering the column y in table R and the column z in table S, respectively. In this case, there are two choices:

Optimizers generally believe that the efficiency of querying through indexes is the same, so the Nested Loop Join algorithm mainly requires minimizing the number of driving tables.

Therefore, if the data filtered by WHERE R.y = ? is small, then this SQL statement will first use the index on column y in table R to filter the data, and then use the index on column x in table S for association, and finally filter out the last data through WHERE S.z = ?.

To better understand the selection of the driving table by the optimizer, let’s look at the following SQL:

SELECT COUNT(1) 

FROM orders

INNER JOIN lineitem

  ON orders.o_orderkey = lineitem.l_orderkey 

    WHERE orders.o_orderdate >= '1994-02-01' 

      AND  orders.o_orderdate < '1994-03-01'

You are familiar with the orders table mentioned above, which is similar to the order table in e-commerce. In our example database, it contains a total of 6 million records.

The lineitem table is the order details table. For example, an order can contain information about three items, including their prices, quantities, and suppliers. The number of records is approximately 24 million. The above SQL statement represents querying the total quantity of products purchased in February 1994. The execution plan obtained by using the EXPLAIN command is as follows:

EXPLAIN: -> Aggregate: count(1)
 -> Nested loop inner join (cost=115366.81 rows=549152)
     -> Filter: ((orders.O_ORDERDATE >= DATE'1994-02-01') and (orders.O_ORDERDATE < DATE'1994-03-01')) (cost=26837.49 rows=133612)
         -> Index range scan on orders using idx_orderdate (cost=26837.49 rows=133612)
     -> Index lookup on lineitem using PRIMARY (l_orderkey=orders.o_orderkey) (cost=0.25 rows=4)

The steps of the execution plan above are as follows. The table “orders” is the driving table, and its selection process is shown below:

Index range scan on orders using idx_orderdate: Filtering out the order data for February 1994 using the index idx_orderdate, with an estimated record count of over 130,000.
Index lookup on lineitem using PRIMARY: Using the results of the first step as the driving table, look up the value of o_orderkey in lineitem’s primary key index for each row of the driving table.
Nested loop inner join: Perform the JOIN operation and match the output results.
Aggregate: count(1): Calculate the final quantity of products.

However, if the following SQL is executed, the execution plan will change:

EXPLAIN FORMAT=tree

SELECT COUNT(1) 

FROM orders

INNER JOIN lineitem

  ON orders.o_orderkey = lineitem.l_orderkey 

    WHERE orders.o_orderdate >= '1994-02-01' 

      AND  orders.o_orderdate < '1994-03-01'

      AND lineitem.l_partkey = 620758

EXPLAIN: -> Aggregate: count(1)

-> Nested loop inner join (cost=17.37 rows=2)

    -> Index lookup on lineitem using lineitem_fk2 (L_PARTKEY=620758) (cost=4.07 rows=38)

    -> Filter: ((orders.O_ORDERDATE >= DATE'1994-02-01') and (orders.O_ORDERDATE < DATE'1994-03-01')) (cost=0.25 rows=0)

        -> Single-row index lookup on orders using PRIMARY (o_orderkey=lineitem.l_orderkey) (cost=0.25 rows=1)

The above SQL only adds a condition “lineitem.l_partkey = 620758”, which queries the purchase quantity of the product with item number 620758 in February 1994.

If you carefully examine the execution plan, you will find that there are approximately only 38 records found by filtering the condition l_partkey = 620758. Therefore, in this case, the optimizer selects the table “lineitem” as the driving table.

Hash Join #

The second JOIN algorithm in MySQL is Hash Join, which is used when there is no index for the join condition between two tables.

Some people may ask, why not create an index if there is no join? Perhaps it can be done, but:

If some columns have low selectivity indexes, creating an index will require sorting the data during data import, which affects import performance.
Secondary indexes may have a backtracking problem. If the amount of filtered data is relatively large, directly scanning the entire table will be faster.

For OLAP business queries, Hash Join is an essential feature. Starting from MySQL 8.0, Hash Join algorithm is supported, which strengthens support for OLAP business.

Therefore, if your query data size is not particularly large and the response time requirement for queries is in the minute range, you can use a single MySQL 8.0 instance to complete big data queries.

The pseudocode for the Hash Join algorithm is as follows:

foreach row r in R with matching condition:
    create hash table ht on r
foreach row s in S with matching condition:
    search s in hash table ht:
        if (found)
            send to client

Hash Join will scan the two associated tables:

First, a hash table is created during the process of scanning the driver table.
Then, when scanning the second table, each associated record is searched in the hash table, and if found, the record is returned.

The selection of the driver table and the Nested Loop Join algorithm for Hash Join is similar, where the smaller table is chosen as the driver table. If the driver table is large and the created hash table exceeds the size of the memory, MySQL automatically spills the result to disk.

To demonstrate Hash Join, let’s take a look at the following SQL query:

SELECT 
    s_acctbal,
    s_name,
    n_name,
    p_partkey,
    p_mfgr,
    s_address,
    s_phone,
    s_comment
FROM
    part,
    supplier,
    partsupp,
    nation,
    region
WHERE
    p_partkey = ps_partkey
        AND s_suppkey = ps_suppkey
        AND p_size = 15
        AND p_type LIKE '%BRASS'
        AND s_nationkey = n_nationkey
        AND n_regionkey = r_regionkey
        AND r_name = 'EUROPE';

The above SQL query is used to retrieve information for European suppliers of products with a type of ‘%BRASS’ and a size of 15.

Since the “part” table does not contain region information, we need to obtain the supplier information from the associated table “partsupp”, then get the region information from the supplier metadata table, and finally join with the outer table “region” to get the desired results.

The final execution plan is shown in the following image:

From the image above, we can see that the initial join is between the “supplier” and “nation” tables, followed by the join with the “partsupp” table, and then with the “part” table. Both left and right joins are performed using the Nested Loop Join algorithm. At this point, the result set contains approximately 79,330 records.

Finally, the join with the “region” table is performed, which filters the result to 5 records. At this stage, we have two options:

Create a region-based index on the 73,390 records and use the index for querying within the inner table.
Create a hash table for the “region” table and perform a probe on the 73,390 records.

Option 1 is the approach taken by the optimizer when Hash Join is not supported in MySQL 8.0. The drawback is that if the associated data is very large, creating an index will take time, and there may also be a need for lookups, so the optimizer will likely choose to directly scan the inner table.

Option 2 is to create a hash index only for the approximately 5 records in the “region” table, which takes almost negligible time, and then directly scan the inner table without any lookup issues. It is evident that MySQL 8.0 will choose Hash Join.

After understanding the optimizer’s choice, let’s take a look at the final result of the EXPLAIN FORMAT=tree command for the execution plan:

-> Nested loop inner join  (cost=101423.45 rows=79)
    -> Nested loop inner join  (cost=92510.52 rows=394)
        -> Nested loop inner join  (cost=83597.60 rows=394)
            -> Inner hash join (no condition)  (cost=81341.56 rows=98)

-> Filter: ((part.P_SIZE = 15) and (part.P_TYPE like '%BRASS'))  (cost=81340.81 rows=8814)

    -> Table scan on part  (cost=81340.81 rows=793305)

-> Hash

    -> Filter: (region.R_NAME = 'EUROPE')  (cost=0.75 rows=1)

        -> Table scan on region  (cost=0.75 rows=5)

-> Index lookup on partsupp using PRIMARY (ps_partkey=part.p_partkey)  (cost=0.25 rows=4)

-> Single-row index lookup on supplier using PRIMARY (s_suppkey=partsupp.PS_SUPPKEY)  (cost=0.25 rows=1)

-> Filter: (nation.N_REGIONKEY = region.r_regionkey)  (cost=0.25 rows=0)

-> Single-row index lookup on nation using PRIMARY (n_nationkey=supplier.S_NATIONKEY)  (cost=0.25 rows=1)

The above is the implementation principle and application of JOIN in MySQL database.

Because many developers have confusion when writing JOIN, I will now take you deep into the JOIN problem in OLTP business.

Can JOIN be used in OLTP business? #

OLTP business is highly concurrent and requires very responsive results, returning results in milliseconds. Examples of OLTP businesses include Taobao’s e-commerce business, Alipay’s payment business, Meituan’s food delivery business, etc.

If a JOIN in the OLTP business has a WHERE filtering condition and is filtered based on primary keys or indexes, the driving table will have only one or a few records, so the cost of performing the JOIN is very small.

For example, in Taobao’s e-commerce business, when a user wants to view their own order information, it essentially executes SQL statements similar to the following:

SELECT o_custkey, o_orderdate, o_totalprice, p_name FROM orders,lineitem, part

WHERE o_orderkey = l_orderkey

  AND l_partkey = p_partkey

  AND o_custkey = ?

ORDER BY o_orderdate DESC

LIMIT 30;

I have found that many developers think that the JOIN cost of the above SQL statement is very high, so they believe that it would be better to split it into 3 simple SQL statements, like:

SELECT * FROM orders 

WHERE o_custkey = ? 

ORDER BY o_orderdate DESC;

SELECT * FROM lineitem

WHERE l_orderkey = ?;

SELECT * FROM part

WHERE p_part = ?

Actually, you don’t need to split the statements manually, because the process of splitting them is the optimizer’s execution result, and the optimizer is more reliable and faster. The method of splitting into three SQL statements itself increases the time cost of network interaction.

So feel free to use JOIN, you should trust that the database optimizer is smarter than you, it is more professional. The execution plan of the above SQL statement is as follows:

EXPLAIN: -> Limit: 30 row(s)  (cost=27.76 rows=30)

    -> Nested loop inner join  (cost=27.76 rows=44)

        -> Nested loop inner join  (cost=12.45 rows=44)

            -> Index lookup on orders using idx_custkey_orderdate (O_CUSTKEY=1; iterate backwards)  (cost=3.85 rows=11)

            -> Index lookup on lineitem using PRIMARY (l_orderkey=orders.o_orderkey)  (cost=0.42 rows=4)

        -> Single-row index lookup on part using PRIMARY (p_partkey=lineitem.L_PARTKEY)  (cost=0.25 rows=1)

Since the number of records in the driving table is fixed at 30, no matter how large the data volume of the tables orders, lineitem, and part is, even if it is billions of records, the speed of executing the above SQL statement remains almost the same because they are all associated based on primary keys.

Therefore, it is completely safe to use JOIN in OLTP business, but make sure that all the JOIN indexes have been added. DBAs must do SQL Review before the business goes live to ensure that the expected indexes have been created.

Summary #

MySQL database supports two algorithms for JOIN operation: Nested Loop Join and Hash Join. The former is usually used in OLTP business, and the latter is used in OLAP business. JOIN can be used in OLTP business, and the optimizer will automatically choose the best execution plan. However, when using JOIN, make sure that the execution plan of the SQL statement uses the correct indexes and index covering. Therefore, index design is particularly important, which is also one of the important tasks for DBAs in architecture design.