32 My SQL Tuning How to Write High Performance SQL Statements

32 MySQL Tuning How to write high-performance SQL statements #

Hello, I’m Liu Chao.

Starting today, I will guide you through the process of optimizing the performance of MySQL. MySQL is one of the most frequently used databases in internet companies, not only because it is open source and free, but also because its outstanding performance, stable service, and active community have contributed to its core competitiveness.

As we know, the interaction between application services and databases is mainly achieved through SQL statements. During the initial development stage, we focus more on using SQL to implement business functions. However, once the system goes live and the data in the production environment grows rapidly, many of the SQL statements that were written before start to expose performance problems.

In this stage, we should try to avoid implementing slow SQL statements as much as possible. However, it should be noted that there are countless reasons for slow SQL statements. Besides some common slow SQL statements that can be directly avoided, it is not a solution to blindly avoid others. We also need to learn how to analyze and locate the root causes, and summarize some common SQL optimization methods for future needs.

So today, let’s focus on several common factors that can cause slow SQL statements. Starting from these points, we will find the best methods to open the door to high-performance SQL statements.

Common Causes of Slow SQL Statements #

1. Lack of Indexes or Index Inefficiency #

If a column without an index is used as a query condition in a table with millions of data, the query will often be time-consuming. This type of query is undoubtedly a slow SQL query. Therefore, for large amounts of data, we need to establish appropriate indexes to optimize queries.

Although we often create indexes, in some specific scenarios, indexes may still become ineffective. Index inefficiency is another major reason for slow queries. I will discuss optimization for this issue in Lecture 34.

2. Lock Wait #

The storage engines commonly used are InnoDB and MyISAM. InnoDB supports row-level locks and table-level locks, while MyISAM only supports table-level locks.

If database operations are based on table-level locks, imagine this: if an order table needs to lock the entire table during an update, then a large number of other database operations (including queries) will be in a waiting state. This will seriously impact the system’s concurrency performance.

At this point, row-level locks supported by the InnoDB storage engine are more suitable for high-concurrency scenarios. However, when using the InnoDB storage engine, we need to pay special attention to the possibility of row locks being upgraded to table locks. During bulk update operations, row locks are likely to be upgraded to table locks.

MySQL believes that using a large number of row locks on a table will cause a decrease in transaction execution efficiency, resulting in long waiting times for other transactions and more lock conflict issues, leading to a significant decrease in performance. Therefore, MySQL will upgrade row locks to table locks. Additionally, row locks are based on index locking. If the index becomes ineffective during an update operation, the row locks will also be upgraded to table locks.

Therefore, database operations based on table-level locks will cause SQL blocking and affect the execution speed. In scenarios where the number of update operations (insert/update/delete) is greater than or equal to the number of read operations, MySQL does not recommend using the MyISAM storage engine.

In addition to lock escalation, row-level locks, though they have finer granularity compared to table-level locks and improved concurrency capability, also bring new problems, such as deadlocks. Therefore, when using row-level locks, we need to be careful to avoid deadlocks. I will discuss deadlocks further in Lecture 35.

3. Improper SQL Statements #

Using improper SQL statements is also one of the most common causes of slow SQL. For example, using the , SQL statement, performing pagination queries in large data tables, and sorting non-indexed fields, etc.

Steps to Optimize SQL Statements #

Usually, when we execute an SQL statement, we want to know which tables are queried, whether indexes are used, where the data is obtained from, and how many rows are traversed. We can use the EXPLAIN command to view this execution information. This information is commonly referred to as the execution plan.

1. Analyze SQL Execution Plan with EXPLAIN #

Assuming that we want to check if the current SQL statement is using indexes, we can use the EXPLAIN command to export the corresponding execution plan as shown below:

img

Let’s explain each field in the diagram, and you can also learn many scattered knowledge points from it.

  • id: Each execution plan has an id. If it is a joined query, there will be multiple ids.
  • select_type: Indicates the type of SELECT query. Common types include SIMPLE (normal query without JOIN or subquery), PRIMARY (main query), UNION (queries after UNION), SUBQUERY (subquery), and so on.
  • table: The table being queried in the current execution plan. If an alias is given to the table, the alias information will be displayed.
  • partitions: The partition table being accessed.
  • type: Indicates the method used to retrieve rows from the table. The query type is an important metric in SQL optimization, and the result values are ordered from best to worst: system > const > eq_ref > ref > range > index > ALL.

img

  • system/const: There is only one row in the table that matches the condition, and the corresponding data can be found by querying the index once. If it is a B+ tree index, we know that the index is constructed with multiple levels of trees. When the queried index is at the bottom level of the tree, the query efficiency is lower. “const” indicates that the index is at the first level, and only needs to access one level to obtain the data.

img

  • eq_ref: Unique index scan, commonly used in join queries where primary key and unique index are used as join conditions.

img

  • ref: Non-unique index scan, also seen in scans where the unique index follows the leftmost matching rule.

img

  • range: Index range scan, for example, <, >, between, etc.

img

  • index: Index full table scan, meaning the entire index tree is traversed.

img

  • ALL: Represents full table scan, which requires traversing the entire table to find the corresponding rows.
  • possible_keys: Possible indexes that could be used.
  • key: The actual index used.
  • key_len: The length of the index being used.
  • ref: ID associated with the index.
  • rows: The number of scanned rows to find the records.
  • filtered: The percentage of records required compared to the total scanned records.
  • Extra: Additional information.

2. Analyze SQL Performance with Show Profile #

The above analysis using EXPLAIN only stays at analyzing the external execution of SQL. If we want to dive into the MySQL kernel and analyze from the perspectives of thread state and time, we can choose Show Profile.

In addition to analyzing the thread state and time, Show Profile also supports further selection of types such as ALL, CPU, MEMORY, BLOCK IO, CONTEXT SWITCHES, etc., to query the time consumed by SQL statements on different system resources. Here are the explanations of related commands:

SHOW PROFILE [type [, type] ... ]
[FOR QUERY n]
[LIMIT row_count [OFFSET offset]]
 
"type" parameters:
| ALL: Show all expense information.
| BLOCK IO: Block input/output count.
| CONTEXT SWITCHES: Context switch expense information.
| CPU: CPU expense information.
| IPC: Message receiving and sending expense information.
| MEMORY: Memory-related expense information (currently useless).
| PAGE FAULTS: Page fault expense information.
| SOURCE: List the function names corresponding to the operation and their call locations (line numbers) in the source code.
| SWAPS: Swap count expense information.

Note that Show Profile is only supported in MySQL versions after 5.0.37. If you are unsure, you can check if the functionality is supported by querying select @@have_profiling. The latest version of MySQL enables Show Profile by default, but it was disabled by default in older versions. You can enable it at the session level using the SET statement:

img

Show Profiles only displays the most recently sent SQL statements to the server. By default, it records the latest 15 executed records. You can increase the storage record by resetting profiling_history_size, with a maximum value of 100.

After obtaining the Query_ID, we can use the Show Profile for Query ID statement to view the time consumed by each thread state of the SQL statement with the corresponding Query_ID:

img

Based on the analysis above, we can see that the SQL statement SELECT COUNT(*) FROM order; took the longest time in the Sending data state. This is because in this state, the MySQL thread starts to read data and returns it to the client, involving a large number of disk I/O operations.

Common SQL Optimization Techniques #

When using some common SQL queries, if we optimize their implementation using certain methods and techniques, the performance can be greatly improved compared to using conventional approaches. In fact, the performance of SQL queries can even be increased by an order of magnitude.

1. Optimizing Pagination Queries #

Typically, we use LIMIT combined with an appropriate ORDER BY clause to implement pagination queries. However, when there are no index conditions to support such queries, a large number of file sorts are required, resulting in poor performance. Even when there is an associated index, the initial performance of pagination queries may be satisfactory, but as we progress further in the pagination, the performance starts to degrade.

This is because when we use LIMIT, the offset M becomes larger the further we go in the pagination, resulting in the retrieval of more rows by the database. For example, in a query like LIMIT 10000, 10, the database needs to query 10010 records and finally return only 10 records. This means that 10000 records are being queried but not used.

Let’s simulate a pagination query on a table with 100,000 entries:

select * from `demo`.`order` order by order_no limit 10000, 20;

From the EXPLAIN analysis, we can see that the query uses an index, scans 10020 rows, but still takes a relatively long time of 0.018 seconds.

img

img

  • Optimizing Pagination Queries with Subquery

The problem with the above pagination query is that all the 10020 rows of the query result are being returned to us. Is it possible to first query the minimum ID value from the 20 rows of data we need, and then return the required 20 rows of data using an offset? We can achieve this by using subqueries with index covering scans for pagination queries:

select * from `demo`.`order` where id > (select id from `demo`.`order` order by order_no limit 10000, 1) limit 20;

From the EXPLAIN analysis, we can see that the subquery traverses a similar index range as the previous query, while the main query scans more rows. However, the execution time reduces significantly to just 0.004 seconds because only 20 rows are returned, leading to a noticeable improvement in efficiency.

img

img

2. Optimizing SELECT COUNT(*) #

COUNT() is an aggregate function commonly used to count rows or specific column values (excluding NULL values). The most commonly used variants are COUNT(*) and COUNT(1) which have no apparent difference. In the presence of a primary key, both variants leverage the primary key column to count the number of rows.

However, the execution principles behind COUNT() differ in the MyISAM and InnoDB storage engines. Usually, in the absence of any query conditions, COUNT(*) queries in MyISAM perform significantly faster than in InnoDB.

This is because the MyISAM storage engine maintains the total number of rows in a table, so when performing a COUNT(*) query, it doesn’t need to scan the entire table and can directly return the saved value. On the other hand, InnoDB needs to scan the table to calculate the exact number of rows. When a WHERE condition is added, both MyISAM and InnoDB need to scan the table for row counts.

It is not advisable to frequently perform SELECT COUNT(*) operations on a large table. So how do we optimize COUNT() for large tables?

  • Using Approximate Values

Sometimes, in certain business scenarios, an exact COUNT value is not necessary. In such cases, we can use an estimated value instead. We can use EXPLAIN to estimate the count of rows. It is important to note that executing EXPLAIN does not actually perform the query but returns an estimated approximate value.

  • Adding Summary Statistics

If an exact COUNT value is required, we can create an additional summary statistics table or cache field to store the required count value. This approach incurs a certain cost during insertion and deletion, but can greatly enhance the performance of COUNT().

3. Optimizing SELECT * #

I have seen many colleagues develop a habit of using SELECT * in SQL statements, even when querying only one or two columns. However, this practice can lead to performance degradation in certain environments.

MySQL commonly uses the MyISAM and InnoDB storage engines. When creating a primary key in InnoDB, it also creates a primary key index, which is part of a clustered index. This means that when storing data, the index is based on a B+ tree structure, and the leaf nodes store the row data.

On the other hand, the default primary key index created by MyISAM, as well as secondary indexes in InnoDB, are both non-clustered indexes. They also use a B+ tree structure for storage, but the leaf nodes store the primary key values.

Let’s assume we have an orders table created using the InnoDB storage engine, with a composite index consisting of the order_no and status columns. Now, we need to query the status of an order based on its order number. If we use a query like SELECT * FROM order WHERE order_no='xxx', it first queries the composite index, retrieves the primary key ID, and then uses the primary key ID to fetch all the column values from the primary key index.

But if we use SELECT order_no, status FROM order WHERE order_no='xxx', it only queries the composite index and fetches the corresponding order_no and status values from it. If you are not familiar with these indexes, please pay attention to Lesson 34, which will explain database indexes in detail.

Summary #

In development, we should strive to write high-performance SQL statements. However, it is inevitable that some slow SQL statements will appear, either due to oversights or differences between the production environment and the development environment. These are the factors that contribute to this situation. In response to this, we can open the slow SQL configuration and record which SQL statements exceed the expected maximum execution time. First, we can use the following command line to query whether the slow SQL recording function is enabled and what the maximum execution time is:

Show variables like 'slow_query%';
Show variables like 'long_query_time';

If it is not enabled, we can enable it through the following settings:

set global slow_query_log='ON'; // Enable slow SQL log
set global slow_query_log_file='/var/lib/mysql/test-slow.log';// Log file address
set global long_query_time=1;// Maximum execution time

In addition, many database connection pool middleware also have the function to analyze slow SQL. In short, we should avoid low-performance SQL operations in programming. In addition to having some common SQL optimization techniques, we should also make full use of SQL tools to achieve SQL performance analysis and monitoring.

Discussion Questions #

Suppose there is an order table named order, primarily including columns such as primary key order_no, order status status, submission time create_time, etc. Indexes have been created on the status column and create_time column. Currently, the specific implementation code to obtain the order number in descending order of creation time when the status is 1 is as follows:

select order_no from order where status = 1 order by create_time desc

Do you know what the problem is in the above code? And how should we optimize it?


Answer #

The problem with the given code is that it’s missing backticks around the table name “order”. In SQL, “order” is a reserved keyword, so it should be enclosed in backticks to be used as a table name.

To improve the query performance, we can add an index on the status column in combination with the create_time column. With this composite index, the database engine can perform a faster search by using both columns together. The new query would look like this:

select order_no from `order` where status = 1 order by create_time desc

Additionally, ensure that the status and create_time columns are properly indexed and that the table is appropriately tuned for performance. Also, consider using pagination if there are a large number of rows to avoid retrieving all the data at once.