18 Why Do These SQL Statements With Similar Logic Perform Differently

18 Why Do These SQL Statements with Similar Logic Perform Differently #

In MySQL, there are many SQL statements that appear to have the same logic but have a huge difference in performance. If these statements are used improperly, they can unintentionally increase the load on the entire database.

Today, I have selected three such cases to share with you. I hope that when you encounter similar problems, you can generalize from one example and quickly solve the problem.

Case 1: Function operations on conditional fields #

Let’s say you are currently maintaining a trading system, and the trading log table tradelog contains fields such as trade ID (tradeid), trader ID (operator), and trade time (t_modified). For the sake of simplicity, let’s ignore other fields for now. The CREATE TABLE statement for this table is as follows:

mysql> CREATE TABLE `tradelog` (
  `id` int(11) NOT NULL,
  `tradeid` varchar(32) DEFAULT NULL,
  `operator` int(11) DEFAULT NULL,
  `t_modified` datetime DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `tradeid` (`tradeid`),
  KEY `t_modified` (`t_modified`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

Suppose that you have already recorded all the data from the beginning of 2016 to the end of 2018. The operations department has a requirement to count the total number of trading records that occurred in the month of July across all years. This logic doesn’t seem to be complicated, and your SQL statement might look like this:

mysql> select count(*) from tradelog where month(t_modified)=7;

Since there is an index on the t_modified field, you confidently execute this statement in the production database. However, you find that it takes a long time to return the results.

If you ask your DBA colleague why this is the case, they will probably tell you that if a function calculation is applied to a field, the index cannot be used. This is a rule in MySQL.

Now that you have learned about the index structure of InnoDB, you can ask why. Why is it possible to use the index when the condition is where t_modified='2018-7-1', but not when it is changed to where month(t_modified)=7?

Below is a schematic diagram of this t_modified index. The number above the box represents the value corresponding to the month() function.

img

Figure 1: Schematic diagram of the t_modified index

If your SQL statement uses the condition where t_modified='2018-7-1', the engine will quickly locate the result needed for t_modified='2018-7-1' along the green arrow shown above.

In fact, this fast location capability provided by the B+ tree is derived from the ordering of sibling nodes on the same level.

However, when calculating the month() function, you will see that when the value 7 is passed, the engine does not know how to proceed on the first level of the tree.

That is to say, performing a function operation on an indexed field might break the ordering of index values, so the optimizer decides to abandon the tree search function.

It is important to note that the optimizer is not giving up on using this index.

In this example, by giving up the tree search function, the optimizer can choose to traverse the primary key index or the t_modified index. After comparing the sizes of the indexes, the optimizer finds that the t_modified index is smaller, making it faster to traverse this index than the primary key index. Therefore, the optimizer ultimately chooses the t_modified index.

Next, let’s use the explain command to see the execution result of this SQL statement:

img

Figure 2: explain result

The value of key is “t_modified”, indicating that the t_modified index is used. I inserted 100,000 rows of data in the test table, and rows=100335, which means that this statement scanned all the values of the index. The Extra field’s “Using index” indicates the use of a covering index.

In other words, due to the month() function operation on the t_modified field, a full index scan occurred. In order to utilize the fast location capability of the index, we need to modify the SQL statement to perform a range query based on the field itself. The following is the modified SQL statement, which allows the optimizer to use the fast location capability of the t_modified index as we expect:

mysql> select count(*) from tradelog where
    -> (t_modified >= '2016-7-1' and t_modified<'2016-8-1') or
    -> (t_modified >= '2017-7-1' and t_modified<'2017-8-1') or 
    -> (t_modified >= '2018-7-1' and t_modified<'2018-8-1');

Of course, if your system has been online for a longer time or if you insert data for subsequent years, you will need to fill in the other year ranges.

So far, I have explained to you that performing a function operation on an indexed field prevents MySQL from using the fast location capability of the index, and only a full index scan can be used.

However, the optimizer does have a “lazy” behavior in this regard. Even for functions that do not change the ordering, it will not consider using the index. For example, for the SQL statement select * from tradelog where id + 1 = 10000, the addition operation does not change the ordering, but the MySQL optimizer still cannot quickly locate row 9999 using the id index. Therefore, when you write SQL statements, you need to manually rewrite them as where id = 10000 - 1 to achieve the desired result.

Case 2: Implicit type conversion #

Now let me tell you about another example that often traps programmers. Let’s take a look at this SQL statement together:

mysql> select * from tradelog where tradeid=110717;

The field tradeid, which is the trade number, already has an index. However, the result of the explain statement shows that this query needs to perform a full table scan. You may have also noticed that the field type of tradeid is varchar(32), and the input parameter is an integer. Therefore, type conversion is required.

Now, we have two questions:

  1. What are the rules for data type conversion?
  2. Why does a full index scan occur when there is a data type conversion?

Let’s start with the first question. You might say that there are so many types in the database and I can’t remember all the rules for data type conversion. What should I do?

Here is a simple method: let’s see the result of select "10" > 9:

  1. If the rule is “convert the string to a number”, then it should perform a numeric comparison, and the result should be 1.
  2. If the rule is “convert the number to a string”, then it should perform a string comparison, and the result should be 0.

The result of the verification is shown in Figure 3.

img

Figure 3: Illustration of string and number conversion in MySQL

From the figure, we can see that select "10" > 9 returns 1. Therefore, we can confirm the conversion rule in MySQL: when comparing strings and numbers in MySQL, the strings are converted to numbers.

Now, if we look at this full table scan query:

mysql> select * from tradelog where tradeid=110717;

We can understand that for the optimizer, this query is equivalent to:

mysql> select * from tradelog where CAST(tradid AS signed int) = 110717;

In other words, this query triggers the rule we mentioned earlier: when performing a function operation on an indexed field, the optimizer will abandon the tree search function.

Now, I will leave you with a question: if the type of id is int, will executing the following query cause a full table scan?

select * from tradelog where id="83126";

You can analyze it yourself first, and then verify and confirm it in the database.

Next, let’s take a look at a slightly more complex example.

Case 3: Implicit Character Encoding Conversion #

Suppose there is another table called trade_detail in the system, which is used to record transaction operation details. To facilitate quantitative analysis and reproduction, I will insert some data into two tables, tradelog and trade_detail.

mysql> CREATE TABLE `trade_detail` (
  `id` int(11) NOT NULL,
  `tradeid` varchar(32) DEFAULT NULL,
  `trade_step` int(11) DEFAULT NULL, /* operation step */
  `step_info` varchar(32) DEFAULT NULL, /* step information */
  PRIMARY KEY (`id`),
  KEY `tradeid` (`tradeid`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

insert into tradelog values(1, 'aaaaaaaa', 1000, now());
insert into tradelog values(2, 'aaaaaaab', 1000, now());
insert into tradelog values(3, 'aaaaaaac', 1000, now());

insert into trade_detail values(1, 'aaaaaaaa', 1, 'add');
insert into trade_detail values(2, 'aaaaaaaa', 2, 'update');
insert into trade_detail values(3, 'aaaaaaaa', 3, 'commit');
insert into trade_detail values(4, 'aaaaaaab', 1, 'add');
    insert into trade_detail values(5, 'aaaaaaab', 2, 'update');
    insert into trade_detail values(6, 'aaaaaaab', 3, 'update again');
    insert into trade_detail values(7, 'aaaaaaab', 4, 'commit');
    insert into trade_detail values(8, 'aaaaaaac', 1, 'add');
    insert into trade_detail values(9, 'aaaaaaac', 2, 'update');
    insert into trade_detail values(10, 'aaaaaaac', 3, 'update again');
    insert into trade_detail values(11, 'aaaaaaac', 4, 'commit');

Now, if you want to query all the operation steps of the trade with id=2, you can write the SQL statement as follows:

mysql> select d.* from tradelog l, trade_detail d where d.tradeid=l.tradeid and l.id=2; /* Statement Q1 */

Figure 4: Explain result of Statement Q1

Let’s take a look at the result together:

  1. The first row indicates that the optimizer will first find the row with id=2 in the tradelog table. This step uses the primary key index, and rows=1 indicates that only one row is scanned.
  2. The second row shows key=NULL, which means that the tradeid index on the trade_detail table is not used, and a full table scan is performed.

In this execution plan, the trade_log table is used to get the tradeid value, and then the matching rows are searched in the trade_detail table. In this case, tradelog is referred to as the driving table, trade_detail is referred to as the driven table, and tradeid is referred to as the join field.

Next, let’s take a look at the execution process represented by this explain result:

img

Figure 5: Execution process of Statement Q1

In the figure:

  1. Step 1 is to find the L2 row in the tradelog table based on the id.
  2. Step 2 is to retrieve the tradeid field value from L2.
  3. Step 3 is to search for the rows that match the conditions in the trade_detail table based on the tradeid value. The key=NULL in the second row of the explain result indicates that this process uses a traversal of the primary key index, checking one by one if the tradeid value matches.

At this point, you may notice that Step 3 does not meet our expectations. Because the trade_detail table has an index on the tradeid field, we originally hoped that we could quickly locate the equal rows by using the tradeid index. However, this is not the case here.

If you ask a DBA colleague, they may tell you that the reason is because the character sets of these two tables are different, with one being utf8 and the other utf8mb4, so the index on the join field is not used for the join query. This answer is also what you usually get when you search for this problem.

But you should ask again, why is the index not used when the character sets are different?

We say that the problem lies in Step 3 of the execution process. If we convert this step into an SQL statement, it would be:

mysql> select * from trade_detail where tradeid=$L2.tradeid.value; 

Here, $L2.tradeid.value has the utf8mb4 character set.

Referring to the previous two examples, you may realize that utf8mb4 is a superset of utf8. So when these two types of strings are compared, MySQL internally converts the utf8 string to utf8mb4 before the comparison.

This setting is easy to understand, as utf8mb4 is a superset of utf8. In programming languages, when performing automatic type conversion, data conversion is also performed in the direction of “increasing data length” to avoid data errors caused by truncation during the conversion.

Therefore, when executing the above statement, the values in the indexed columns on the driven table need to be converted one by one to utf8mb4 and compared with L2.

In other words, in reality, this statement is equivalent to the following format:

select * from trade_detail  where CONVERT(traideid USING utf8mb4)=$L2.tradeid.value; 

Here, the CONVERT() function is used to convert the input string to the utf8mb4 character set.

This once again triggers the principle we mentioned earlier: if a function operation is performed on an indexed column, the optimizer will abandon the tree search function.

At this point, you finally realize that the reason why the index is not used is not only due to the different character sets, but also because a function operation is performed on the indexed column on the driven table during the join process.

As a comparison, let me provide you with another requirement: “Find out who performed the operation corresponding to id=4 in the trade_detail table.” Let’s take a look at the SQL statement and its execution plan.

mysql> select l.operator from tradelog l , trade_detail d where d.tradeid=l.tradeid and d.id=4;

img Figure 6 explains the result.

In this statement, the trade_detail table becomes the driving table. However, the second line of the explain result shows that this query operation uses the index (tradeid) in the driven table tradelog, and the number of scanned rows is 1.

This is also a join operation between two tradeid fields. Why can the index of the driven table be used this time? Let’s analyze it.

Suppose the row with id=4 in the driving table trade_detail is denoted as R4. Then, during the join operation (Step 3 in Figure 5), the following SQL statement is executed on the driven table tradelog:

select operator from tradelog  where tradeid = $R4.tradeid.value; 

At this time, the character set of $R4.tradeid.value is utf8. According to the character set conversion rules, it needs to be converted to utf8mb4. Therefore, this process is rewritten as:

select operator from tradelog  where tradeid = CONVERT($R4.tradeid.value USING utf8mb4); 

You see, the CONVERT function is applied to the input parameter, so the index tradeid of the driven table can be used.

Once we understand the principle, we can use it to guide our operations. If we want to optimize the statement:

select d.* from tradelog l, trade_detail d where d.tradeid = l.tradeid and l.id = 2;

there are two approaches:

  • The more common optimization method is to change the character set of the tradeid field in the trade_detail table to utf8mb4, so that there is no character set conversion issue.

    alter table trade_detail modify tradeid varchar(32) CHARACTER SET utf8mb4 default null;

  • If we can modify the character set of the field, that would be even better. But if the data volume is large or there are temporary business constraints, we can only modify the SQL statement.

    mysql> select d.* from tradelog l, trade_detail d where d.tradeid = CONVERT(l.tradeid USING utf8) and l.id = 2;

img

Figure 7 shows the explain result after optimizing the SQL statement.

Here, I manually convert l.tradeid to utf8, which avoids character encoding conversion on the driven table. From the explain result, we can see that the index is used correctly this time.

Summary #

In today’s article, I gave you three examples, which were essentially talking about the same thing, that is: performing function operations on indexed fields may disrupt the order of the index values, so the optimizer chooses not to use the tree search function.

The second example is implicit type conversion, and the third example is implicit character encoding conversion. Like the first example, all of them require performing function operations on indexed fields, which leads to a full index scan.

The optimizer in MySQL does seem to be a bit “lazy”. Even if simply rewriting where id+1=1000 as where id=1000-1 can enable the use of index for fast lookup, it does not actively rewrite the statement.

Therefore, it is a good practice to explain newly added SQL statements every time your business code is updated.

Finally, it’s time for a question.

The question I leave you with today is: Have you encountered any other performance issues similar to the ones we discussed today? What do you think the reasons were, and how did you solve them?

Please write about your experiences and analysis in the comments section. I will select interesting comments to share and analyze with everyone in the next article. Thank you for listening, and feel free to share this article with more friends to read together.

Previous Question #

In the last article, the question I left you with was: In the last solution we mentioned, which involved using three limit Y,1 to obtain the desired data, do you think there is any further optimization method?

Here, I provide one approach. Take the maximum number in Y1, Y2, and Y3, denoted as M, and take the minimum number as N. Then, execute the following SQL statement:

mysql> select * from t limit N, M-N+1;

Also, add the whole table’s total row count, denoted as C. With this solution, the total number of scanned rows only needs C+M+1 rows.