32 Why Are There Still Non Killable Sentences

32 Why are There Still Non-Killable Sentences #

In MySQL, there are two kill commands: kill query + thread id, which terminates the statement being executed in the thread; and kill connection + thread id, where the “connection” can be omitted, which cuts the connection of the thread. Of course, if the thread has a statement that is executing, the executing statement must be stopped first.

I don’t know if you have encountered this situation when using MySQL: you use the kill command but fail to disconnect the connection. When you execute the show processlist command again, you see that the Command column of this statement shows “Killed”.

You must be wondering what “Killed” means. Shouldn’t the thread disappear from the show processlist result?

Today, let’s discuss this issue.

In fact, in most cases, the kill query/connection command is effective. For example, if a query takes too long to execute and you want to give up the query, you can use the kill query command to terminate the query statement.

Another situation is when a statement is in a lock waiting state, using the kill command directly is also effective. Let’s look at this example together:

img

Figure 1 Example of successful kill query

As you can see, after session C executes kill query, session B is interrupted almost at the same time. This is the result we expect.

What does the thread do after receiving the kill command? #

However, here, you have to stop and think: does session B just terminate the thread without doing anything and exit directly? Clearly, this is not the case.

In [Article 6], I mentioned that when performing INSERT, DELETE, UPDATE, and SELECT operations on a table, an MDL read lock will be added to the table. Therefore, although session B is in the blocked state, it still holds an MDL read lock. If the thread is directly terminated when it is killed, there will be no chance to release this MDL read lock.

From this perspective, killing does not mean stopping immediately. Instead, it tells the executing thread that the statement no longer needs to be executed, and it can start executing the “stop execution” logic.

In fact, this is similar to the Linux kill command. kill -N pid does not directly stop the process; instead, it sends a signal to the process, and then the process handles this signal and enters the termination logic. However, for the MySQL kill command, there is no need to pass the signal parameter and only the “stop” command is available.

In terms of implementation, when a user executes kill query thread_id_B, the thread in MySQL that handles the kill command does two things:

  1. Changes the running status of session B to THD::KILL_QUERY (assigns the variable killed with THD::KILL_QUERY);
  2. Sends a signal to the execution thread of session B.

Why send a signal?

Because, as in the example above, when session B is in a lock waiting state, if only the thread status of session B is set to THD::KILL_QUERY, the thread B does not know about this status change and will continue to wait. The purpose of sending a signal is to make session B exit the waiting state and handle this THD::KILL_QUERY status.

The above analysis implies the following three things:

  1. During the execution of a statement, there are multiple “checkpoints” where the thread status is checked. If the thread status is THD::KILL_QUERY, the statement enters the termination logic;
  2. If it is in a waiting state, it must be a wait that can be awakened, otherwise, it will not reach the “checkpoint” at all;
  3. There is a process from the start of entering the termination logic to the completion of the termination logic.

With this information, you now know that it’s not just “stop on command”.

Next, let’s look at another example of an unkillable statement, which is the example we mentioned in the previous [Article 29], where the innodb_thread_concurrency is not sufficient.

First, execute set global innodb_thread_concurrency=2 to set the upper limit of InnoDB’s concurrent threads to 2; then, execute the following sequence: img

Figure 2 Invalid example of kill query

Observations:

  1. Session C is blocked during execution;
  2. However, executing the kill query C command in session D has no effect;
  3. It is not until session E executes the kill connection command that the connection of session C is terminated, with the prompt “Lost connection to MySQL server during query”;
  4. At this point, if you execute show processlist in session E, you will see the following figure.

img

Figure 3 Effect after kill connection

At this time, the Command column of thread id=12 displays “Killed”. This means that although the client has disconnected, the statement is still being executed on the server.

Why does the statement not exit like the update statement in the first example when executing the kill query command?

In the implementation, when waiting for a row lock, the pthread_cond_timedwait function is used, and this wait state can be woken up. However, in this example, the wait logic of thread 12 is as follows: check every 10 milliseconds whether it can enter InnoDB execution, if not, call the nanosleep function to enter sleep state.

In other words, although the state of thread 12 has been set to KILL_QUERY, during this loop process of waiting to enter InnoDB, it does not check the thread state at all, so it will not enter the termination logic stage at all.

When executing the kill connection command in session E, it is done as follows:

  1. Set the state of thread 12 to KILL_CONNECTION;
  2. Close the network connection of thread 12. Because of this operation, you will see that session C receives the prompt of disconnection.

So why does the Command column display “killed” when executing show processlist? Actually, this is because when executing show processlist, there is a special logic:

If the state of a thread is KILL_CONNECTION, then display "Killed" in the Command column.

Therefore, even if the client exits, the state of this thread is still waiting. When will this thread exit?

The answer is that it will only exit when the condition for entering InnoDB is satisfied, and then the query of session C continues to execute. At that point, it is possible to determine that the thread state has become KILL_QUERY or KILL_CONNECTION, and then enter the termination logic stage.

Now, let’s summarize.

This example belongs to the first type of situation where kill is ineffective, which means that the thread has not reached the logic of checking the thread state. Similar to this situation, there is also the situation where IO pressure is too high, and the read and write IO functions cannot return, causing the thread state to not be determined in a timely manner.

The other type of situation is that the termination logic takes a long time. In this case, the show processlist result also displays Command=Killed. It needs to wait until the termination logic is completed before the statement can be truly completed. Common scenarios for this type of situation include the following:

  1. Kill during the execution of a large transaction. At this time, the rollback operation needs to reclaim all new data versions generated during the execution of the transaction, which takes a long time.
  2. Rollback of a large query. If a relatively large temporary file is generated during the query process, and the file system is under heavy pressure at this time, deleting the temporary file may require waiting for IO resources, resulting in a long processing time.
  3. During the execution of a DDL command, if it is killed at the final stage, the temporary files generated in the intermediate process need to be deleted, which may also take a long time due to IO resource constraints.

Someone asked me before, if I directly use the Ctrl+C command in the client, can I directly terminate the thread? The answer is no.

There is a misunderstanding here. In the client operation, you can only manipulate the client’s thread. The client and the server can only interact through the network and cannot directly manipulate the server thread.

Because MySQL is a stop-wait protocol, sending additional commands to this connection while the thread is still executing a statement that has not returned is useless. In fact, when executing Ctrl+C, the MySQL client starts another connection and sends a kill query command.

So, don’t think that everything will be fine after executing Ctrl+C on the client. Killing a thread involves many backend operations.

Two other misunderstandings about client usage #

In actual use, I often encounter some misunderstandings about client usage. Let’s take a look at the two most common ones.

The first misunderstanding is that if there are many tables in the database, the connection will be slow.

Some live databases may contain many tables (the most I have seen is 60,000 tables). In this case, you will find that each time a client connects, it will get stuck at the following interface.

img

Figure 4 Connection Waiting

And if the db1 database has fewer tables, the connection will be faster and you can quickly enter the command input state. Therefore, some people think that the number of tables affects the connection performance.

As you know from the [first article], what each client needs to do when establishing a connection with the server is TCP handshake, user validation, and permission acquisition. However, these operations are obviously unrelated to the number of tables in the database.

However, as the text prompt in the image says, when connecting with the default parameters, the MySQL client provides a local database and table name completion feature. To achieve this feature, the client needs to do more operations after successfully connecting:

  1. Execute show databases;
  2. Switch to the db1 database and execute show tables;
  3. Use the results of these two commands to build a local hash table.

Among these operations, the most time-consuming one is building the hash table locally in the third step. Therefore, when there are a large number of tables in a database, this step will take a relatively long time.

In other words, the slow connection process we perceive is not actually due to slow connection or slow server, but slow client.

As the prompt in the image also says, if you add -A to the connection command, you can turn off this automatic completion feature, and the client can return quickly.

The effect of automatic completion is that when you enter the prefix of a database or table name, you can use the Tab key to automatically complete the table name or display suggestions.

In actual use, if you don’t use the automatic completion feature much, I suggest you always add -A by default when using it.

Actually, the prompt doesn’t mention it, but in addition to adding -A, adding the –quick (or -q) parameter can also skip this stage. However, this –quick is a parameter that is more easily misunderstood and is also a common misconception about the client.

When you see this parameter, do you think it should be a parameter that speeds up the server? But in fact, setting this parameter may actually decrease the server’s performance. Why do I say that?

After the MySQL client sends a request and receives the server’s response, there are two ways to handle the results:

  1. One is local caching, which means allocating some memory locally to store the results. If you are using the API for development, this corresponds to the mysql_store_result method.
  2. The other is non-caching, reading and processing one by one. If you are using the API for development, this corresponds to the mysql_use_result method.

The MySQL client defaults to the first method, and if the –quick parameter is added, the second non-caching method will be used. When using the non-caching method, if the local processing is slow, it will cause the server to be blocked from sending the result, thus making the server slower. As for the specific behavior of the server, I will explain it in more detail in the next article.

So you might ask, if that’s the case, why is this parameter called “quick”? This is because using this parameter can achieve the following three effects:

  • First, as mentioned earlier, it skips the table name auto-completion feature.
  • Second, mysql_store_result needs to allocate local memory to cache the query result. If the query result is too large, it may consume a lot of local memory and may affect the performance of the client’s local machine.
  • Third, it does not record the execution commands in the local command history file.

So you see, the meaning of the --quick parameter is to make the client faster.

Summary #

In today’s article, I first introduced to you the situation where some statements and connections in MySQL cannot be “killed”.

These situations where connections cannot be “killed” are actually because the client sending the kill command does not forcefully stop the execution of the target thread, but only sets a status and wakes up the corresponding thread. The thread being killed needs to reach the “checkpoint” that checks the status before entering the termination logic phase. And, the termination logic itself also takes time.

Therefore, if you find a thread in the “Killed” state, the only thing you can do is to let this “Killed” state end as soon as possible by affecting the system environment.

For example, if it is the issue of InnoDB concurrency in the first example, you can temporarily increase the value of innodb_thread_concurrency or stop other threads to give way to this thread.

And if the rollback logic is executed slowly due to limited IO resources, you can speed it up by reducing system pressure.

After doing these operations, there is actually nothing else you can do but wait for the process to complete.

Finally, let me leave you with a question to think about.

If you encounter a transaction that has been killed and is stuck in the rollback state, do you think you should force restart the MySQL process or let it complete on its own? Why?

You can write your conclusions and analysis in the comments section, and I will discuss this question with you at the end of the next article. Thank you for listening, and feel free to share this article with more friends to read together.

Question from the previous article #

In the previous article, I left you with a question, hoping that you could share your experience in handling accidentally deleted data.

@苍茫 mentioned an example that I think is worth sharing with everyone. The operations team directly copies and executes text, causing SQL statements to be truncated, resulting in database execution error.

Copying and executing text from a web browser is a very non-standard operation. In addition to the SQL statement truncation issue mentioned in this example, there may also be garbled characters.

In general, if the script is developed and executed by different people, the developer needs to put the script on Git and send the Git address, as well as the MD5 of the file, to the operations team.

This means that the operations team needs to confirm the MD5 of the file they are about to execute and make sure it is the same as the one provided by the developer before proceeding with the execution.

In addition, I would like to give special praise to @苍茫’s approach to reproducing and troubleshooting issues.

@linhui0705 mentioned the “four-script” method, which I highly recommend. These four scripts are: backup script, execution script, verification script, and rollback script. If you can stick to this approach, even if problems occur, they can be quickly recovered from, and it will greatly reduce the probability of failures.

However, the biggest enemy of this approach is the thinking that “this is a small operation, it doesn’t need to be so strict”.

@Knight²º¹⁸ provided a method to protect files, which I haven’t used before, but it is indeed a good approach.

To ensure data security and service stability, it is better to design and discuss preventive solutions than to handle failures and conduct post-mortem analysis. The atmosphere in the meeting rooms for solution design discussion and post-mortem analysis is completely different. Those who have experienced it will understand.