41 How to Copy a Table Fastest

41 How to Copy a Table Fastest #

In my previous article, I left you with a question about how to copy data between two tables. If you can control the number of rows scanned from the source table and have a small locking range, you can simply use the insert ... select statement to achieve this.

Of course, to avoid reading locks on the source table, a more reliable solution is to write the data to an external text file and then write it back to the target table. There are two commonly used methods for this. I will explain these two methods to you in detail.

To facilitate the explanation, I will first create a table db1.t and insert 1000 rows of data, while also creating a table db2.t with the same structure.

create database db1;
use db1;

create table t(id int primary key, a int, b int, index(a)) engine=innodb;
delimiter ;;
  create procedure idata()
  begin
    declare i int;
    set i=1;
    while(i<=1000)do
      insert into t values(i,i,i);
      set i=i+1;
    end while;
  end;;
delimiter ;
call idata();

create database db2;
create table db2.t like db1.t

Now, let’s say we want to export the data rows where a > 900 from db1.t and insert them into db2.t.

Using mysqldump #

One method is to use the mysqldump command to export the data as a set of INSERT statements. You can use the following command:

mysqldump -h$host -P$port -u$user --add-locks=0 --no-create-info --single-transaction  --set-gtid-purged=OFF db1 t --where="a>900" --result-file=/client_tmp/t.sql

This command has the following main parameters:

  1. --single-transaction is used to export the data without locking the table db1.t, instead using the method START TRANSACTION WITH CONSISTENT SNAPSHOT.
  2. --add-locks set to 0 means that the output file result will not include “LOCK TABLES t WRITE;”.
  3. --no-create-info means that the table structure does not need to be exported.
  4. --set-gtid-purged=off means that GTID-related information will not be output.
  5. --result-file specifies the output file path, where client indicates that the generated file is on the client machine.

The t.sql file generated by this mysqldump command contains the INSERT statements shown in the following image:

img

As you can see, each INSERT statement contains multiple value pairs, which is designed to improve the insert performance.

If you want each INSERT statement in the generated file to insert only one row of data, you can add the --skip-extended-insert parameter when executing the mysqldump command.

Next, you can use the following command to execute these INSERT statements in the db2 database:

mysql -h127.0.0.1 -P13000 -uroot db2 -e "source /client_tmp/t.sql"

It is important to note that source is not an SQL statement, but a client command. The execution process of the mysql client for this command is as follows:

  1. Open the file and read the SQL statements one by one by default, with semicolons as the delimiter.
  2. Send the SQL statements to the server for execution.

In other words, what the server actually executes is the INSERT statements, not the “source t.sql” statement. Therefore, both the slow query log and the binlog record the INSERT statements that are actually executed.

Exporting to CSV file #

Another method is to directly export the result as a .csv file. MySQL provides the following syntax to export query results to the local directory of the server:

select * from db1.t where a>900 into outfile '/server_tmp/t.csv';

When using this statement, please note the following:

  1. This statement will save the result on the server. If the client machine and the MySQL server are not on the same machine, the t.csv file will not be generated in the client’s temporary directory.
  2. into outfile specifies the output file location (/server_tmp/). This location must be restricted by the secure_file_priv parameter. The possible values and their meanings are as follows: * If set to empty, there is no restriction on the file generation location, which is insecure. * If set to a string representing a path, the generated file can only be placed in the specified directory or its subdirectories. * If set to NULL, it means that the select ... into outfile operation is prohibited on this MySQL instance.
  3. This command will not overwrite the file, so you need to make sure that the file /server_tmp/t.csv does not exist, otherwise an error will occur due to the existence of a file with the same name.
  4. In the text file generated by this command, one line corresponds to one data row in principle. However, if a field contains a newline character, the generated text will also contain a newline character. However, symbols such as newline characters and tab characters are preceded by the escape character “" to distinguish them from field and row delimiters.

After obtaining the exported .csv file, you can use the following load data command to import the data into the target table db2.t:

load data infile '/server_tmp/t.csv' into table db2.t;

The execution process of this statement is as follows:

  1. Open the file /server_tmp/t.csv, read the data using tabs (\t) as field delimiters and newline characters (\n) as record delimiters.
  2. Start a transaction.
  3. Check if the number of fields in each row matches that of the table db2.t: * If they are not the same, an error is directly reported and the transaction is rolled back. * If they are the same, construct a row and call the InnoDB engine interface to write it to the table.
  4. Repeat step 3 until the entire file /server_tmp/t.csv is read, and then commit the transaction. You may have a question about how a LOAD DATA statement recorded in the binlog if binlog_format=statement is replayed on the backup.

Since the /server_tmp/t.csv file is only saved on the master server, if only the statement itself is written to the binlog, when executed on the backup, the backup machine does not have access to this file, causing the master-slave synchronization to stop.

Therefore, the complete process for executing this statement is as follows:

  1. After the master server executes the statement, the content of the /server_tmp/t.csv file is directly written to the binlog file.
  2. The statement LOAD DATA LOCAL INFILE '/tmp/SQL_LOAD_MB-1-0' INTO TABLE db2.t`` is written to the binlog file.
  3. The binlog file is transferred to the backup.
  4. When the apply thread on the backup executes this transaction log: a. First, the content of the t.csv file in the binlog is read and written to the local temporary directory /tmp/SQL_LOAD_MB-1-0. b. Then, the LOAD DATA statement is executed to insert the same data into the db2.t table on the backup as on the master.

The execution process is shown in the following diagram:

img

Figure 2: Synchronization process of LOAD DATA.

Note that the LOAD DATA statement executed on the backup includes an additional “local” keyword. Its meaning is “load the contents of the file /tmp/SQL_LOAD_MB-1-0 on the machine where the command is executed into the target table db2.t”.

In other words, the LOAD DATA command has two usages:

  1. Without “local”, it reads the server-side file, which must be in the directory or subdirectory specified by secure_file_priv.
  2. With “local”, it reads the client-side file, as long as the MySQL client has the permission to access this file. In this case, the MySQL client first transfers the local file to the server and then executes the aforementioned LOAD DATA process.

It is also important to note that the SELECT ... INTO OUTFILE method does not generate a table structure file. Therefore, when exporting data, a separate command is needed to obtain the table structure definition. mysqldump provides a --tab parameter that can export both the table structure definition file and the CSV data file. The usage of this command is as follows:

mysqldump -h$host -P$port -u$user --single-transaction --set-gtid-purged=OFF db1 t --where="a>900" --tab=$secure_file_priv

This command will create a t.sql file in the directory defined by $secure_file_priv, which saves the table creation statement, and simultaneously create a t.txt file to save the CSV data.

Physical Copying Method #

The mysqldump method and exporting CSV files mentioned earlier are both logical data export methods, meaning that the data is read from table db1.t, converted to a text format, and then written into the target table db2.t.

You may wonder if there is a physical data export method. For example, can we directly copy the .frm and .ibd files of table db1.t to the db2 directory?

The answer is no.

This is because an InnoDB table not only consists of these two physical files, but it also needs to be registered in the data dictionary. If you simply copy these two files, the system will not recognize and accept them because there is no table db2.t in the data dictionary.

However, starting from MySQL 5.6, a method called Transportable Tablespaces was introduced, which allows the physical copying of tables through exporting and importing tablespace.

Suppose our goal is to create a new table r under the db1 database that is identical to table t. The specific steps to execute are as follows:

  1. Execute CREATE TABLE r LIKE t to create an empty table with the same structure.

  2. Execute ALTER TABLE r DISCARD TABLESPACE to delete the r.ibd file.

  3. Execute FLUSH TABLE t FOR EXPORT. This will generate a t.cfg file in the db1 directory.

  4. Execute the commands cp t.cfg r.cfg; cp t.ibd r.ibd in the db1 directory (Note that the MySQL process requires read and write permissions to copy the files).

  5. Execute unlock tables, and the t.cfg file will be deleted.

  6. Execute alter table r import tablespace to use the r.ibd file as the new tablespace for table r. Since the data in this file is the same as t.ibd, table r will contain the same data as table t.

With these steps, the process of copying table data is completed. The execution diagram for this process is shown below:

img

Figure 3 Physical copying of tables

Regarding this process of copying tables, here are a few points to note:

  1. After executing the flush table command in step 3, the entire db1.t table is in a read-only state until the unlock tables command is executed to release the read lock.
  2. When executing import tablespace, the tablespace ID in r.ibd is modified to match the data dictionary, which exists in each data page. Therefore, if the file is large (e.g., in the TB range), each data page needs to be modified, which means the execution of the import statement will take some time. However, compared to logical import methods, the import statement is relatively quick.

Summary #

In this article, I introduced three methods for importing data from one table to another.

Let’s compare the advantages and disadvantages of these three methods.

  1. The physical copying method is the fastest, especially for copying large tables. If a table is accidentally deleted, the fastest way to recover the data is to restore the temporary database from a backup prior to the deletion and then copy the table from the temporary database to the production database. However, this method has certain limitations:
    • It can only copy the entire table and cannot copy partial data.
    • The data needs to be copied from the server, so it cannot be used in scenarios where users cannot log in to the database host.
    • Both the source and target tables need to use the InnoDB engine, as the method is based on copying physical files.
  2. Using mysqldump to generate a file with INSERT statements allows for exporting only specific data by adding filtering conditions to the WHERE parameter. One limitation of this method is that complex WHERE conditions using joins cannot be used.
  3. Using the select … into outfile method is the most flexible as it supports all SQL syntax. However, this method can only export data from one table at a time, and the table structure needs to be backed up separately.

The last two methods are logical backup methods and can be used across different database engines.

Lastly, I’ll leave you with a question.

In the previous article when discussing binlog_format=statement, the load data command recorded in the binlog has a “local” parameter. Since this command is executed on the replica, why is the “local” parameter needed for the execution on the replica? What problems would occur if the load data command in the binlog does not have the “local” parameter?

Feel free to share your analysis in the comments, and I will discuss this question with you in the next article. Thank you for reading, and feel free to share this article with more friends to read together.

Previous Question Discussion #

I have answered the question I left you with in the last article during the discussion in today’s article.

There were a few excellent comments in the comment section of the previous article, which I’d like to share with you.

@huolang raised a question: if sessionA acquires a write lock for the record with c=5, why can sessionB and sessionC acquire a read lock for c=5?

This is because next-key locks are applied first to the gap and then to the record. If acquiring the gap lock is successful, acquiring the record lock will be blocked. If you have any questions about this process, you can review the relevant content in [Article 30].

@一大只 conducted an experiment to verify the effect of inserting into a table with a gap lock after a primary key conflict. This experiment was more intuitive than the example I provided in the previous article, demonstrating his excellent understanding and analysis of this knowledge. Well done!

@roaming confirmed that in MySQL 8.0, it is now possible to use temporary tables to handle insert … select statements that write to the original table.