22 What Methods Does MySQL Have to Quench Thirst and Improve Performance #

I don’t know if you have encountered this situation in your actual operations: during peak periods of business activity, the pressure on the MySQL server in the production environment becomes too high and it is unable to respond normally. In such cases, temporary measures need to be taken to temporarily improve performance.

When I used to provide support for business operations, I occasionally encountered this situation. The development manager would say, no matter what solution you use, just make sure the business can keep running.

However, if there is a lossless solution, there is no need to wait until this moment to implement it. Today, let’s talk about these temporary solutions and focus on the risks they may pose.

Short Connection Storm #

In a normal short connection model, the client connects to the database, executes a few SQL statements, and then disconnects. If short connections are used, during peak periods of business activity, the number of connections may suddenly skyrocket.

As I mentioned in my first article “Infrastructure: How Does a SQL Query Execute?”, establishing a connection in MySQL is costly. In addition to the normal three-way handshake for network connections, it also involves performing login permission checks and obtaining data read/write permissions for the connection.

When the database is under relatively low load, these additional costs are not significant.

However, the short connection model has one risk: if the database processing slows down, the number of connections may increase dramatically. The max_connections parameter is used to control the maximum number of connections that can exist simultaneously in a MySQL instance. If this value is exceeded, the system will reject the subsequent connection requests and report an error saying “Too many connections”. From a business perspective, the rejected connection requests mean that the database is unavailable.

When the machine load is high, the time it takes to process existing requests becomes longer, and each connection is kept open for a longer period of time. At this time, if new connections are established, it may exceed the limit of max_connections.

In such a case, a natural idea is to increase the value of max_connections. However, this is risky. The purpose of designing the max_connections parameter is to protect MySQL. If we increase it too much and allow more connections to come in, the system load may further increase, and a large amount of resources will be consumed in permission verification and other logic. As a result, the already established threads will not be able to obtain CPU resources to execute the business’s SQL requests.

So, do you have any other suggestions in this situation? Here are two more methods, but please be aware that these methods are all lossy.

First method: Deal with the threads that are occupying connections but not working.

The calculation of max_connections does not consider whether the thread is running or not; as long as it is connected, it occupies a counting position. For connections that are not needed to be kept, we can actively kick them off by using the kill connection command. This behavior has the same effect as setting wait_timeout in advance. Setting the wait_timeout parameter means that if a thread is idle for wait_timeout seconds, MySQL will directly disconnect it.

However, it is worth noting that killing threads that appear as “sleep” in the show processlist result may have side effects. Let’s take a look at the following example.

Figure 1: Two states of “sleep” threads

In the example above, if we disconnect session A, because session A has not yet been committed, MySQL can only handle it as a rollback. On the other hand, if we disconnect session B, it won’t have much impact. Therefore, if we prioritize based on importance, you should prioritize disconnecting idle connections outside transactions like session B.

But how can we determine which connections are idle outside transactions? We can execute show processlist in session C after time T + 30 seconds to see the following result.

Figure 2: Two states of “sleep” threads shown in the show processlist result

In the figure, both sessions with id=4 and id=5 are in the Sleep state. To see the specific transaction status, you can query the innodb_trx table in the information_schema database.

Figure 3: Querying transaction status from information_schema.innodb_trx

In this result, trx_mysql_thread_id=4 indicates that the thread with id=4 is still in a transaction. Therefore, if there are too many connections, you can prioritize disconnecting idle connections outside of transactions that have been idle for too long; if this is still not enough, consider disconnecting idle connections within transactions.

To disconnect from the server, the command used is “kill connection + id”. When a client is in sleep mode and its connection is actively disconnected by the server, the client does not immediately know about it. It is only when the client makes the next request that it receives the error message “ERROR 2013 (HY000): Lost connection to MySQL server during query”.

Actively disconnecting from the database can be detrimental, especially if some application does not reconnect after receiving this error and instead retries the query with the same invalid handle. This can make it seem like “MySQL is not recovering” from the application’s perspective.

You might think this is a joke, but in reality, I have encountered this situation at least 10 times.

Therefore, if you are a DBA supporting the business, do not assume that all application code will be handled correctly. Even for a simple disconnection operation, make sure to notify the business development team.

Second method: Reduce the overhead of the connection process.

Some business code may request a large number of database connections in a short period of time for backup purposes. If the database is confirmed to be overwhelmed by connection behavior, one possible approach is to bypass the permission verification stage.

The method to bypass permission verification is to restart the database and start it with the “–skip-grant-tables” parameter. This way, MySQL will bypass all permission verification stages, including the connection process and statement execution process.

However, this method fits the term “drinking poison to quench thirst” and carries a high risk. It is a solution that I particularly do not recommend. Especially if your database is accessible on the public network, you should never do this.

In MySQL 8.0, if you enable the “–skip-grant-tables” parameter, MySQL will enable the “–skip-networking” parameter by default, indicating that the database can only be connected by local clients. It can be seen that MySQL attaches great importance to the security issues of the “skip-grant-tables” parameter.

In addition to the performance problems that can be caused by a sudden increase in short-lived connections, in reality, we encounter more performance issues caused by query or update statements. Among them, there are two typical types of query problems: slow queries and a sudden increase in QPS (Queries Per Second). As for the performance issues caused by update statements, I will explain them in my next article.

Performance Problems with Slow Queries #

In MySQL, there are generally three possible causes for performance problems caused by slow queries:

Poorly designed indexes;
Poorly written SQL statements;
Incorrect index selection by MySQL.

Next, let’s analyze these three possibilities and their corresponding solutions in more detail.

The first possible cause of slow queries is poorly designed indexes.

In this scenario, the best approach is usually to create indexes urgently. Starting from MySQL version 5.6, creating indexes is supported with Online DDL. For cases where the database is overwhelmed during peak periods, it is most efficient to directly execute the “alter table” statement.

Ideally, this can be done on a standby database first. Assuming your service consists of one master and one standby, with the master being A and the standby being B, the general process for this solution is as follows:

Execute the “set sql_log_bin=off” statement on standby B, which means no binlog is written. Then execute the “alter table” statement to add the index.
Perform a master-slave switch.
At this point, B becomes the master and A becomes the standby. On A, execute the “set sql_log_bin=off” statement, and then execute the “alter table” statement to add the index.

This is an “old-fashioned” DDL solution. Normally, when making changes, you should consider more reliable solutions such as “gh-ost”. However, in urgent situations, the above solution is the most efficient.

The second possible cause of slow queries is poorly written SQL statements. For example, we made the mistakes mentioned in the 18th article [“Why do these SQL statements have the same logic but significantly different performance?”] that resulted in the statements not using indexes.

In this case, we can handle it by rewriting the SQL statements. MySQL 5.7 provides the query_rewrite feature, which can rewrite one type of statements into another pattern.

For example, if the statement is mistakenly written as select * from t where id + 1 = 10000, you can add a statement rewrite rule using the following method:

mysql> insert into query_rewrite.rewrite_rules(pattern, replacement, pattern_database) values ("select * from t where id + 1 = ?", "select * from t where id = ? - 1", "db1");
   
call query_rewrite.flush_rewrite_rules();

Here, the call query_rewrite.flush_rewrite_rules() stored procedure is used to make the newly inserted rule take effect, which is what we call “query rewriting”. You can use the method in Figure 4 to confirm whether the rewrite rule is effective.

Figure 4: Effect of query rewriting

The third possibility that causes slow queries and performance problems is when we encounter the situation mentioned in the 10th article [“Why does MySQL sometimes choose the wrong index?”], where MySQL chooses the wrong index.

In this case, the emergency solution is to add force index to the statement.

Similarly, using the query_rewrite feature to add force index to the original statement can also solve this problem.

Among the three possible situations causing performance problems due to slow queries, the first two are the most common, namely: poorly designed indexes and poorly written statements. And these two situations can be completely avoided. For example, through the following process, we can pre-detect problems.

Before going live, turn on the slow query log in the testing environment and set long_query_time to 0 to ensure that every statement is recorded in the slow query log.
Insert simulated production data into the test table and perform regression testing.
Observe the output of each type of statement in the slow query log, paying particular attention to whether the Rows_examined field is consistent with expectations. (We have used the Rows_examined method multiple times in previous articles, and I believe you have tried it yourself. If you still don’t understand, feel free to leave me a message and we can discuss it together).

Don’t be stingy with the “extra” time spent before going live because this will save you a lot of time in troubleshooting.

If there are not many new SQL statements, you can manually run them. However, for new projects or modifications to the table structure design of existing projects, full regression testing is necessary. At this time, you need a tool to check the return results of all SQL statements. For example, you can use the open-source tool pt-query-digest (https://www.percona.com/doc/percona-toolkit/3.0/pt-query-digest.html).

QPS Surge Problem #

Sometimes, due to sudden business peaks or application bugs, the QPS of a certain statement suddenly surges, which may also cause excessive pressure on MySQL and affect the service.

I have encountered a class of cases where it was caused by a bug in a new feature. Of course, the best situation is for the business to disable this feature, and the service will naturally recover.

If this feature needs to be disabled from the database side, different methods are available depending on the background. I will explain it further here.

One situation is caused by a bug in a completely new business. Assuming your DB operation is relatively standardized, meaning that the whitelist is added one by one, in this case, if you can be sure that the business will disable this feature, just not immediately, then you can directly remove the whitelist from the database side.
If this new feature uses a separate database user, you can delete this user using the administrator account and disconnect the existing connections. This way, connections related to this new feature will fail, and the QPS caused by it will become 0.
If this newly added feature is deployed with the main feature, we can only restrict it by modifying the statements. In this case, we can use the query_rewrite feature mentioned above to directly rewrite the SQL statement with the highest pressure into “select 1” and return.

Of course, this operation carries a high risk and requires great caution on your part. There may be two potential side effects:

If this SQL statement template is used in other functionalities as well, it may cause unintended consequences;
Many business processes are not solely reliant on this one statement to complete their logic. Therefore, if only the result of this statement (e.g., SELECT 1) is returned, it could lead to the failure of subsequent business logic.

Therefore, Solution 3 is a stopgap measure and should be considered as the lowest priority option among all the alternatives, just like removing access controls.

At the same time, you will realize that both Solution 1 and 2 rely on a standardized operations system: virtualization, whitelist mechanisms, and separate business accounts. This shows that greater preparation often results in a more stable system.

Conclusion #

In today’s article, I discussed some emergency measures for handling performance issues during peak business periods.

These measures include both bluntly denying and disconnecting connections, as well as bypassing certain pitfalls by rewriting statements. There are temporary high-risk solutions as well as relatively safe contingency plans.

In actual development, we should try to avoid inefficient methods, such as excessive use of short-lived connections. Additionally, if you are involved in business development, you should be aware that connection disruptions are commonplace. Your code should have mechanisms for correct reconnection and retrying.

Although DBAs can temporarily resolve issues through statement rewriting, this itself is a high-risk operation. Conducting proper SQL auditing can reduce the need for such actions.

In fact, you can see that the solutions mentioned in this article mainly focus on the server layer. In the next article, I will continue to discuss handling methods related to InnoDB.

Finally, it’s time for our reflection question.

Today, I would like to ask if you have encountered any scenarios where temporary solutions were needed to salvage a situation during peak business periods. How did you handle it?

Feel free to share your experiences and expertise in the comments section. I will select interesting comments to share and analyze with everyone in the next article. Thank you for listening, and feel free to share this article with more friends to read together.

Previous Question Time #

In the previous two articles, I left you with the following question: in the execution sequence of the following diagram, why is the insert statement from session B being blocked?

Let’s analyze using the locking rules mentioned in the previous article to see which locks are acquired by the select statement from session A:

Since it is ORDER BY c DESC, the first target is the row with the rightmost c value of 20 on index c. Therefore, it acquires a gap lock on (20,25) and a next-key lock on (15,20].
When scanning left on index c, it stops at c=10. Hence, the next-key lock extends to (5,10], which is the reason behind the blocking of the insert statement from session B.
During the scan, the three rows with c=20, c=15, and c=10 all have values, and since it is SELECT *, it acquires three row locks on the primary key id.

Therefore, the locking scope for the select statement from session A is as follows:

On index c, (5, 25);
On the primary key index, row locks on id=15 and id=20.

I’ll reiterate once again that you may notice I always mention on which index the locks are added in the article. This is because locks are added to indexes; it is one of the fundamental premises of InnoDB. It is important to remember this when analyzing problems.