14 Partitioning Tables in Which Scenarios I Don't Recommend Partitioning

In the previous lessons, we have learned about various aspects of index design, including the principles of indexing, index covering, and optimization of complex SQL queries. In this final lesson of Module 2, I would like to talk about the design of partitioned tables, which are used to physically partition data.

Partitioned tables involve both table structure design and index design, as well as a philosophical question for a database: should we use partitioned tables?

Next, we will learn about partitioned tables, including their usage, considerations, misconceptions, and their design in a business context.

Usage of Partitioned Tables #

Simply put, a partitioned table is a logical large table created by combining several tables with identical physical table structures using a certain algorithm. This algorithm is referred to as the “partition function,” and MySQL currently supports five types of partition functions: RANGE, LIST, HASH, KEY, and COLUMNS.

Regardless of which partition function is chosen, you need to specify the columns that will serve as the input conditions for the partition algorithm. These columns are known as “partition columns.” Additionally, in MySQL partitioned tables, the primary key must also be a part of the partition columns. Otherwise, the creation of the partitioned table will fail. For example:

CREATE TABLE t (
    a INT,
    b INT,
    c DATETIME(6),
    d VARCHAR(32),
    e INT,
    PRIMARY KEY (a,b)
)
partition by range columns(c) (
    PARTITION p0000 VALUES LESS THAN ('2019-01-01'),
    PARTITION p2019 VALUES LESS THAN ('2020-01-01'),
    PARTITION p2020 VALUES LESS THAN ('2021-01-01'),
    PARTITION p9999 VALUES LESS THAN (MAXVALUE)
);

ERROR 1503 (HY000): A PRIMARY KEY must include all columns in the table's partitioning function (prefixed columns are not considered).

In the above example, table t is created with a composite primary key consisting of columns a and b. The intention is to split the data based on the column c (datetime column) and store different time data in different partitions.

However, from the error message, we can see that the primary key of the partitioned table must include the columns used in the partition function. Therefore, to create a partitioned table that splits data based on column c, the primary key must include column c. The following table creation statement will be successful:

CREATE TABLE t (
    a INT,
    b INT,
    c DATETIME,
    d VARCHAR(32),
    e INT,
    PRIMARY KEY (a,b,c),
    KEY idx_e (e)
)
partition by range columns(c) (
    PARTITION p0000 VALUES LESS THAN ('2019-01-01'),
    PARTITION p2019 VALUES LESS THAN ('2020-01-01'),
    PARTITION p2020 VALUES LESS THAN ('2021-01-01'),
    PARTITION p9999 VALUES LESS THAN (MAXVALUE)
);

After creating the table, you will find four corresponding ibd files for each partition in the physical storage. In other words, the data is stored in different files based on the time column c:

t#p#p0000.ibd  t#p#p2019.ibd  t#p#p2020.ibd  t#p#p9999.ibd

So, what you need to understand is that a partitioned table in MySQL divides a large table into multiple smaller tables, each with its own index. From a logical perspective, it is still considered as one table, but physically it is stored in different files.

Furthermore, when it comes to implementing unique indexes, it may be different from what you originally thought. Let’s continue to explore.

Considerations for Partitioned Tables: Unique Indexes #

In MySQL databases, the indexes of partitioned tables are local, not global. This means that each partition file has its own independent index. Therefore, unique indexes on partitioned tables must include the partitioning column information; otherwise, an error will occur during creation. For example:

ALTER TABLE t ADD UNIQUE KEY idx_d(d);

ERROR 1503 (HY000): A UNIQUE INDEX must include all columns in the table's partitioning function (prefixed columns are not considered).

As you can see from the error message, a unique index must include all columns in the partition function. The following creation statement will be successful:

ALTER TABLE t ADD UNIQUE KEY idx_d(d,c);

However, because the unique index includes the partitioning column, the uniqueness of the index is now limited to the current partition only, rather than being global. As a result, both of the following records can be inserted into table t:

INSERT INTO t VALUES
(1,1,'2021-01-01','aaa',1),
(1,1,'2020-01-01','aaa',1);

SELECT * FROM t;
+---+---+---------------------+------+------+
| a | b | c                   | d    | e    |
+---+---+---------------------+------+------+
| 1 | 1 | 2020-01-01 00:00:00 |aaa   |    1 |
| 1 | 1 | 2021-01-01 00:00:00 |aaa   |    1 |
+---+---+---------------------+------+------+

As you can see, the values in column d are both ‘aaa’, but they can still be inserted. This means that column d is not unique. Therefore, the uniqueness is only achieved within the current partition, not globally.

So, how can we achieve global uniqueness in a unique index? Similar to the table structure design we discussed earlier, a global unique string can be used for the unique index (such as a UUID-like implementation), which will prevent the issue of local uniqueness.

Misconception of Partitioned Tables: Performance Improvement #

Many people mistakenly believe that partitioned tables can significantly improve the performance of MySQL databases because they divide a large table into several smaller tables. This is a misconception! If you expect partitioned tables to improve performance, then I advise against using partitions because it won’t achieve that goal.

Partitioned tables are not used to improve the performance of MySQL databases; they are designed to facilitate data management.

Let’s revisit the relationship between the height of a B+ tree and the amount of data storage mentioned in Lesson 08:

From the table, we can see that a B+ tree with a height of 4 can store billions of data records, and a single query only requires 4 I/O operations, which is very fast.

However, after using partitioning, the effect is different. For example, for the table t mentioned earlier, if we split it into one table per year based on time, although the height of the B+ tree decreases from 4 to 3, the improvement is minimal.

Furthermore, partitioned tables introduce new performance issues, such as queries on non-partition columns. Even if an index has been created on the partition column, queries on non-partition columns still need to be performed on each partition separately. Next, let’s take a look at this SQL statement and its execution plan:

SELECT * FROM t WHERE d = 'aaa'

******** 1. row ********

   id: 1

select_type: SIMPLE

  table: t

partitions: p0000,p2019,p2020,p9999

   type: ALL

possible_keys: NULL

     key: NULL

 key_len: NULL

     ref: NULL

    rows: 2

filtered: 50.00

   Extra: Using where

From the execution plan, we can see that the above SQL statement needs to access 4 partitions. Assuming each partition requires 3 I/O operations, this SQL statement will require a total of 12 I/O operations. However, if a regular table is used, regardless of the number of records, it will only require 4 I/O operations.

Therefore, when designing a partitioned table, it is necessary to understand that your query conditions are all based on the partitioning field, otherwise it will scan all the data or indexes of all partitions. So, the design of partitioned tables does not solve performance problems, but rather focuses more on data migration and backup issues.

To help you better understand the use of partitioned tables, let’s continue with the design of a real business partitioned table.

Business-oriented Design of Partitioned Tables #

Using the Orders table in an e-commerce platform as an example, in massive Internet businesses like Taobao, the data volume of the Orders table can be enormous. For example, if 50 million orders are generated in a day, the Orders table will have nearly 18 billion records in a year.

Therefore, for the Orders table, usually only the data for the most recent year or even a shorter period of time is stored in the database, while historical order data is moved to an archive. Unless there are refund orders that span over a year, most orders become irrelevant from a business perspective once they are completed.

So, if you want to easily manage the data in the Orders table, you can create partitioned tables for the Orders table based on years, like this:

CREATE TABLE `orders` (

  `o_orderkey` int NOT NULL,

  `O_CUSTKEY` int NOT NULL,

  `O_ORDERSTATUS` char(1) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NOT NULL,

  `O_TOTALPRICE` decimal(15,2) NOT NULL,

  `O_ORDERDATE` date NOT NULL,

  `O_ORDERPRIORITY` char(15) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NOT NULL,

  `O_CLERK` char(15) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NOT NULL,

  `O_SHIPPRIORITY` int NOT NULL,

  `O_COMMENT` varchar(79) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NOT NULL,

  PRIMARY KEY (`o_orderkey`,`O_ORDERDATE`),

  KEY `orders_fk1` (`O_CUSTKEY`),

  KEY `idx_orderdate` (`O_ORDERDATE`)

)

PARTITION BY RANGE  COLUMNS(o_orderdate)

(

  PARTITION p0000 VALUES LESS THAN ('1992-01-01') ENGINE = InnoDB,

  PARTITION p1992 VALUES LESS THAN ('1993-01-01') ENGINE = InnoDB,

  PARTITION p1993 VALUES LESS THAN ('1994-01-01') ENGINE = InnoDB,

  PARTITION p1994 VALUES LESS THAN ('1995-01-01') ENGINE = InnoDB,

  PARTITION p1995 VALUES LESS THAN ('1996-01-01') ENGINE = InnoDB,

  PARTITION p1996 VALUES LESS THAN ('1997-01-01') ENGINE = InnoDB,

  PARTITION p1997 VALUES LESS THAN ('1998-01-01') ENGINE = InnoDB,

  PARTITION p1998 VALUES LESS THAN ('1999-01-01') ENGINE = InnoDB,

  PARTITION p9999 VALUES LESS THAN (MAXVALUE)

)

As you can see, in this case, the primary key of the Orders table is modified to (o_orderkey,O_ORDERDATE), and the data is partitioned and stored based on years. If you want to delete data that is older than 1 year, for example, deleting the data for 1998, you need to use the following SQL statement:

DELETE FROM Orders
WHERE o_orderdate >= '1998-01-01'
  AND o_orderdate < '1999-01-01'

However, executing this SQL statement can be quite slow, generate a large amount of binary logs, and cause master-slave replication delays in a production system. With the use of partitioned tables, managing the data becomes much easier, and you can simply use the command to truncate a partition:

ALTER TABLE orders_par
TRUNCATE PARTITION p1998

The above SQL statement executes very quickly because it actually deletes and rebuilds the partition file. Moreover, it only generates a single DDL log, without causing any replication delay issues.

# at 425

#210328 12:10:12 server id 8888  end_log_pos 549        Query   thread_id=9     exec_time=0     error_code=0    Xid = 10

SET TIMESTAMP=1619583012/*!*/;

/*!80013 SET @@session.sql_require_primary_key=0*//*!*/;
    
ALTER TABLE orders TRUNCATE PARTITION p1998

/*!*/;

Summary #

In this lesson, we learned about the principles and usage of partitioned tables, and finally demonstrated how to use partitioned tables using the online Orders table in an e-commerce system. Of course, real e-commerce businesses utilize distributed architectures and employ table partitioning techniques, which we will cover in detail later.

I would like to emphasize today’s key points:

Current MySQL partitioned tables support the partitioning algorithms of RANGE, LIST, HASH, KEY, and COLUMNS.
The creation of a partitioned table requires the primary key to include the partitioning column.
In a partitioned table, a unique index is only unique within the current partition file, not globally unique.
It is recommended to use a globally unique implementation, such as UUID, for unique indexes in partitioned tables.
Partitioned tables do not solve performance problems. If non-partitioned columns are used for querying, the performance may actually be worse.
It is recommended to use partitioned tables for data management, fast performance, and smaller logs.

I would like to remind you once again that partitioned tables are not a means to improve performance, but rather a way to facilitate data management.