31 What to Do After Accidentally Deleting Data #

Today, I want to discuss a heavy topic with you: accidentally deleting data.

In the previous articles, we have discussed the high availability architecture of MySQL. Of course, traditional high availability architectures cannot prevent accidental data deletion because a “drop table” command on the master database will be transmitted to all the slave databases and cascading slave databases through the binlog, resulting in the execution of this command on the entire cluster.

Although most of the deleted data we have encountered before has been blamed on the operations or DBA team, in fact, anyone who has data operation permissions can accidentally delete data.

Today, let’s talk about what we can do before and after accidentally deleting data, in order to reduce the risk of data deletion and the losses caused by it.

In order to find a more efficient solution to recover from accidental data deletion, we need to classify the accidental data deletion related to MySQL:

Accidentally deleting data rows using the “delete” statement;
Accidentally deleting data tables using the “drop table” or “truncate table” statement;
Accidentally deleting databases using the “drop database” statement;
Accidentally deleting the entire MySQL instance using the “rm” command.

Accidental Deletion of Rows #

In the [24th article], we mentioned that if data rows are accidentally deleted using the “delete” statement, they can be recovered using the Flashback tool.

The principle of recovering data with Flashback is to modify the content of the binlog and replay it on the original database. The precondition for using this method is to ensure that the binlog_format is set to “row” and binlog_row_image is set to “FULL”.

When recovering data, the following steps are taken for each transaction:

For “insert” statements, the corresponding binlog event type is “Write_rows event”, and it needs to be changed to “Delete_rows event”;
Similarly, for “delete” statements, the “Delete_rows event” needs to be changed to “Write_rows event”;
For “update” statements, the binlog records the values of the data row before and after the modification, so swapping the positions of these two rows will suffice.

What if multiple accidents occur? For example, the following three transactions:

(A)delete ...
(B)insert ...
(C)update ...

Now, if we want to restore the database to the state before these three transactions were performed, after analyzing the binlog with the Flashback tool, the command to be executed on the master database is:

(reverse C)update ...
(reverse B)delete ...
(reverse A)insert ...

In other words, if multiple transactions are involved in the accidental data deletion, the order of the transactions needs to be reversed before execution.

It should be noted that I do not recommend executing these operations directly on the master database.

A safer approach to data recovery is to restore a backup or find a slave database as a temporary database, execute these operations on the temporary database, and then restore the confirmed data from the temporary database to the master database.

Why do we need to do this?

This is because the data state changes in a main database that is executing online logic are often related. It is possible that if the data issue is discovered a little later, the business code logic will continue to modify other data on top of the previous accidental operations. Therefore, if only these rows of data are recovered separately at this time without confirmation, it may result in secondary damage to the data. Of course, it is not enough to only discuss the post-processing methods for accidental data deletion. It is more important to prevent it before it happens. I have two suggestions:

Set the sql_safe_updates parameter to on. This way, if we forget to write a where condition in a delete or update statement, or if the where condition does not include indexed fields, the execution of this statement will fail with an error.
SQL auditing must be performed before deploying the code.

You may ask, what if I really need to delete all the data from a small table?

If you are sure that the delete operation is correct, you can add a where condition to the delete statement, such as where id>=0.

However, deleting the entire table is a slow process, as it requires generating rollback logs, writing redo logs, and writing binlogs. Therefore, from a performance perspective, you should consider using the truncate table or drop table command.

Data deleted using the delete command can be recovered using Flashback. However, data deleted using the truncate/drop table or drop database commands cannot be recovered using Flashback. Why is that?

That’s because, even if we configure binlog_format=row, when executing these three commands, the recorded binlog is still in the statement format. The binlog only contains a truncate/drop statement, and it is not possible to recover the data from this information.

So, what should we do if we accidentally delete data using these commands?

Accidental Deletion of Database / Table #

In this case, to recover the data, we need to use a combination of full backups and incremental logs. This solution requires regular full backups and real-time backup of binlogs.

Assuming that both these conditions are met, here is the data recovery process if someone accidentally deletes a database at 12 noon:

Take the most recent full backup, assuming that the database is backed up once a day and the last backup was taken at 12 midnight on the same day.
Use the backup to restore a temporary instance of the database.
Retrieve the logs from the backup after 12 midnight.
Apply all these logs, except for the statement that caused the accidental data deletion, to the temporary instance.

The diagram below illustrates this process:

Figure 1 Data Recovery Process - mysqlbinlog Method

Regarding this process, I need to explain a few points to you:

To speed up data recovery, if there are multiple databases on this temporary instance, you can use the mysqlbinlog command with the --database parameter to specify the database where the deleted table is located. This avoids the need to apply logs from other databases during data recovery.
When applying the logs, you need to skip the binlog that contains the erroneous operation at 12 noon: * If the original instance does not use GTID mode, you can only use the --stop-position parameter to execute the logs until before the erroneous operation, and then use the --start-position parameter to continue executing the logs starting from after the erroneous operation. * If the instance uses GTID mode, it is much easier. Suppose the GTID of the erroneous operation command is gtid1. You only need to execute set gtid_next=gtid1;begin;commit; to add this GTID to the GTID set of the temporary instance. Then, when executing the binlogs in order, the erroneous statement will be automatically skipped.

However, even with these measures, data recovery using the mysqlbinlog method is still not fast enough for two main reasons:

If a table is accidentally deleted, it would be best to only recover that table, i.e., only replay the operations for that table. However, the mysqlbinlog tool does not allow specifying the logs for only one table.
Applying the logs using mysqlbinlog is a single-threaded process. The parallel replication methods we mentioned in [Article 26] cannot be used here.

One way to speed up the process is to set up the temporary instance as a slave of the online backup server after restoring it:

Before starting the slave, use the change replication filter replicate_do_table = (tbl_name) command to make the temporary instance only synchronize the table affected by the accidental operation.
This approach also allows the use of parallel replication techniques to accelerate the entire data recovery process.

The schematic diagram of this process is shown below.

Figure 2 Data recovery process - master-slave method

As can be seen from the diagram, there is a dashed line representing the binlog backup system to the online backup library. This means that if the binlog required by the temporary instance is already deleted from the backup library due to the passage of time, we can retrieve the required binlog from the binlog backup system and put it back into the backup library.

Assuming that we have discovered that the binlog required by the current temporary instance starts from master.000005, but the smallest binlog file displayed by “show binlogs” on the backup library is master.000007, indicating that two binlog files are missing. In this case, we need to retrieve these two files from the binlog backup system.

The steps to put back the previously deleted binlog files into the backup library are as follows:

Download the two files master.000005 and master.000006 from the backup system and put them into the log directory of the backup library.
Open the master.index file in the log directory and add two lines at the beginning of the file. The contents of the two lines are “./master.000005” and “./master.000006” respectively.
Restart the backup library to make it recognize these two log files again.
Now the backup library has all the binlogs required by the temporary library, and we can establish the master-slave relationship and proceed with normal synchronization.

Regardless of whether we apply the binlog files parsed by the mysqlbinlog tool to the temporary library or connect the temporary library to the backup library, the common point of these two methods is that the main idea of recovering data after accidental deletion of databases or tables is through backup and application of binlogs.

In other words, both of these methods require the backup system to regularly back up full logs and ensure that binlogs have been backed up before being deleted locally.

However, a system cannot back up unlimited logs, so you need to set a retention period for the logs based on cost and disk space resources. If your DBA team tells you that they can guarantee to restore a certain instance to any point in time within the past half month, this means that the backup system needs to retain logs for at least half a month.

In addition, I suggest that regardless of which method you use, you should automate this data recovery function and practice it regularly. Why do I say that?

There are two main reasons for this:

Although “no one wants such a thing to happen”, in case accidental deletion occurs, being able to quickly recover data and minimize losses should prevent you from getting fired.
If you frantically perform manual operations again and end up making mistakes, causing secondary damage to the business, that would be unacceptable.

Delayed Replication Backup #

Although we can speed up the process of recovering data by using parallel replication, this solution still has the problem of “uncontrollable recovery time”.

If a backup of a library is particularly large, or the time interval between accidental operations and the last full backup is long, such as instances that are backed up once a week and an accidental operation occurs on the 6th day after the backup, it would require recovering 6 days worth of logs, and the recovery time would have to be calculated in days.

So, how can we shorten the recovery time required for data?

If there are very critical businesses that do not allow for a long recovery time, we can consider setting up a delayed replication backup library. This feature was introduced in MySQL 5.6.

The problem with the general master-slave replication structure is that if a table on the master is accidentally deleted, this command will quickly be sent to all slave servers, leading to all data tables on the slave servers being deleted as well.

A delayed replication backup library is a special type of backup library that can be configured to have a delay of N seconds with respect to the master library through the “CHANGE MASTER TO MASTER_DELAY = N” command.

For example, if you set N to 3600, it means that if data is accidentally deleted on the master library and the accidental operation command is discovered within 1 hour, this command has not been executed on this delayed replication backup library. At this time, you can execute “stop slave” on this backup library, and using the method mentioned earlier, skip the accidental operation command and recover the required data.

In this way, you can always obtain a temporary instance that can recover data in at most 1 hour, thereby shortening the overall data recovery time.

Methods to Prevent Accidental Deletion of Databases/Tables #

Although accidents can happen, there are still ways to avoid them. So here, I will also provide you with some suggestions to reduce the risk of accidental deletion.

The first recommendation is to separate accounts. The purpose of doing this is to minimize the chances of making mistakes in commands. For example:

We only grant DML permissions to business development colleagues and do not grant truncate/drop permissions. If business developers have DDL needs, they can also be supported through the development management system.
Even for members of the DBA team, it is specified that they should only use read-only accounts and only use accounts with update permissions when necessary.

The second recommendation is to establish operational guidelines. The purpose of this is to avoid writing the wrong table name to delete. For example:

Before deleting a table, you must first rename the table. Then, observe for a period of time to ensure that there is no impact on the business before deleting the table.
When renaming a table, it is required to add a fixed suffix to the table name (for example, add _to_be_deleted), and the action of deleting the table must be executed through the management system. Additionally, the management system can only delete tables with the fixed suffix.

rm Deleting Data #

In fact, for a MySQL cluster with high availability mechanisms, the least thing to worry about is deleting data with rm. As long as it is not maliciously deleting the entire cluster and only deleting the data of one node, the HA system will start working and select a new primary server to ensure the normal operation of the entire cluster.

At this point, what you need to do is to restore the data on this node and connect it to the entire cluster.

Of course, not only DBAs have automated systems now, SAs (system administrators) also have automated systems, so an operation to take offline a batch of machines may cause all the nodes of your MySQL cluster to be shut down.

To cope with this situation, my suggestion can only be to try to store your backups across different data centers, or ideally, across different cities.

Summary #

Today, I discussed several possibilities of accidental data deletion and methods for handling them.

However, I want to emphasize that prevention is much more significant than dealing with the consequences.

In addition, in MySQL cluster solutions, backups are often used to restore instances. Therefore, it is also necessary to periodically check the effectiveness of backups.

If you are a business development colleague, you can use the show grants command to view the account permissions. If the permissions are too high, you can suggest that DBAs allocate a lower-level account to you. You can also evaluate the importance of the business and discuss with DBAs the backup cycle and whether it is necessary to create a delayed replication slave.

The reliability of data and services is not only the responsibility of the operations team, but the result of the joint effort of all stages.

The after-class topic for today is to recall the accidental data deletion events you have personally experienced. What methods did you use to recover the data? What experience did you gain from this process?

You can write about your experiences and insights in the comments section. I will select interesting comments to discuss with you at the end of the next article. Thank you for listening, and feel free to share this article with more friends to read together.

Previous Issue Question #

The question I left for you in the previous article was about the definition of gaps in empty tables.

An empty table only has one gap. For example, by executing the following on an empty table:

begin;
select * from t where id>1 for update;

The range locked by this query is the next-key lock (-∞, supremum].

For verification, you can use the following operation sequence. You can see the displayed results in Figure 4.

Figure 3: Reproduction of next-key lock on an empty table