05 Table Structure Design Forget the Norms of Normalization

05 Table Structure Design - Forget the Norms of Normalization #

In the previous lessons, I introduced you to the selection of numeric types, strings, date types, and unstructured JSON types for each field in a record.

However, when designing a table, there are some basic principles that we need to follow, such as the “normalization guidelines” that you often hear about. But the normalization guidelines are too theoretical, and in real business scenarios, you don’t need to strictly adhere to the requirements of the three normal forms. Moreover, sometimes for performance considerations, you can even design denormalized structures, such as in the data warehouse field. In this lesson, I will take you to understand these contents. I hope that after completing this lesson, you can look at the table structure design of a MySQL database from a higher level.

Forgetting Normalization Guidelines #

I believe that when you studied “Introduction to Database Systems” in university, you must have learned about the design specifications of relational databases, such as the first normal form, second normal form, third normal form, and BCNF. They are important exam points in “Introduction to Database Systems”.

Normalization design is a very important theory. It is a process of deriving normalization through mathematical set concepts. In theory, it requires the table structure design to at least meet the requirements of the three normal forms.

Since it is a completely data deduction process, the normalization theory is very tedious. But as long as you remember a few key points, you can grasp the essence of it:

  • The first normal form requires that all attributes are indivisible basic data items;
  • The second normal form solves partial dependencies;
  • The third normal form solves transitive dependencies.

Although I have summarized the essence of normalization design, in order to truly understand normalization design, you need to abandon the purely theoretical normalization guidelines and design table structures that meet the requirements of the normalization guidelines from a business perspective.

Practical Table Structure Design in Engineering #

Real business scenarios involve engineering implementation. Designing table structures to meet the following points is already sufficient:

  • Each table must have a primary key (primary key design methods include auto-increment primary key design, UUID primary key design, and business-customized generated primary key);
  • Eliminate the possibility of redundant data.

I want to emphasize again that you don’t need to pursue the so-called database normalization guidelines too much, and sometimes we will even design denormalized structures.

Auto-increment Primary Key Design #

The primary key is used to uniquely identify a row of data, so if a table has a primary key, it already satisfies the requirements of the first normal form. In Lesson 01 on integer types, I mentioned that BIGINT with auto-increment type can be used as the primary key. In addition, because integers have the property of auto-increment, database insertion is also sequential, which has good performance.

However, you need to pay attention that using BIGINT with auto-increment type as the primary key design is only suitable for non-core business tables, such as alarm tables, log tables, etc. For true core business tables, do not use auto-increment keys as primary keys , mainly for the following 6 reasons:

  • Auto-increment has backtracking problems;
  • Auto-increment values are generated on the server side, resulting in concurrent performance issues;
  • Auto-increment values as primary keys can only guarantee uniqueness within the current instance, but not globally unique ;
  • Exposing data values can easily cause security issues. For example, knowing the address http://www.example.com/User/10/, it is easy to guess that User has values 11, 12, and so on, which can easily lead to data leakage;
  • Performance issues caused by MGR (MySQL Group Replication);
  • Distributed architecture design issues.

I have mentioned the backtracking problem of auto-increment in Lesson 01. If you want to use auto-increment as the primary key for core business tables, the MySQL database version should be upgraded to version 8.0 as much as possible.

Also, because auto-increment values are generated on the server side, they need an auto-increment AI lock to protect them. If there are a large number of insertion requests at this time, performance bottlenecks caused by auto-increment may occur. For example, in the MySQL database, the parameter innodb_autoinc_lock_mode is used to control the time the auto-increment lock is held. Suppose there is an SQL statement that inserts 3 records with auto-increment values at the same time:

INSERT INTO ... VALUES (NULL,...),(NULL,...),(NULL,...);

The impact of the innodb_autoinc_lock_mode parameter is as shown in the following table:

图片1.png

From the table, you can see that when one SQL statement inserts 3 records, if the innodb_autoinc_lock_mode parameter is set to 1, the auto-increment lock is released only after this SQL statement is executed.

If the innodb_autoinc_lock_mode parameter is set to 2, the auto-increment lock needs to be held 3 times, acquiring the auto-increment lock for each inserted record.

  • The advantage of this design is: The current insertion does not affect the insertion of other auto-increment primary keys, and the maximum concurrent insertion performance of auto-increment can be achieved.
  • The disadvantage is: Multiple records inserted by one SQL statement are not consecutive. For example, the result may be 1, 3, 5, which is monotonically increasing but not continuous.

Therefore, if you want to achieve the maximum performance of auto-increment values, set the innodb_autoinc_lock_mode parameter to 2.

Although we can adjust the innodb_autoinc_lock_mode parameter to achieve the maximum performance of auto-increment, due to the aforementioned 5 problems, in practical battles of internet-scale concurrency architecture, I recommend using UUID as the primary key or business-customized generated primary key.

UUID Primary Key Design #

UUID (Universally Unique Identifier) represents a globally unique identifier. Obviously, due to its global uniqueness, you can use it as the primary key of a database.

MySQL database follows the Version 1 specification defined in DRFC 4122 to generate a 36-byte character through the UUID function. For example:

mysql> SELECT UUID();

+--------------------------------------+

| UUID()                               |

+--------------------------------------+

| e0ea12d4-6473-11eb-943c-00155dbaa39d |
According to Version 1 specification, the UUID in MySQL is composed of the following parts:

UUID = time_low (4 bytes) - time_mid + version (4 bytes) - clock_seq - MAC address

In the first 8 bytes, 60 bits are used to store the time and 4 bits are used for the UUID version. The time is counted in 100ns from October 15, 1582 00:00:00.00 until now.

In the 60-bit time storage, it is divided into:
- time_low, occupying 12 bits
- time_mid, occupying 2 bytes (16 bits)
- time_high, occupying 4 bytes (32 bits)

It is important to note that when storing time, the UUID is stored in reverse order based on the time bits, i.e., the lower time_low is stored at the front and the higher time_high is stored at the end. This means that the first 4 bytes of the UUID will "randomly" change with the passage of time and are not monotonically increasing. Non-sequential values during insertion can cause scattered IO, resulting in performance bottlenecks. This is also the biggest drawback of UUID compared to auto-increment values.

To solve this problem, MySQL 8.0 introduced the function UUID_TO_BIN, which can convert the UUID string in the following ways:
- By passing parameters, the time_high is placed at the front to solve the problem of unordered insertion of UUIDs.
- Removes the unnecessary "-" characters to reduce storage space.
- Converts the string into a binary value, reducing the space from the previous 36 bytes to 16 bytes.

Now let's convert the previous UUID string "e0ea12d4-6473-11eb-943c-00155dbaa39d" using the UUID_TO_BIN function and obtain the binary value:

SELECT UUID_TO_BIN('e0ea12d4-6473-11eb-943c-00155dbaa39d', TRUE) as UUID_BIN;

+------------------------------------+
| UUID_BIN                           |
+------------------------------------+
| 0x11EB6473E0EA12D4943C00155DBAA39D |
+------------------------------------+
1 row in set (0.00 sec)

In addition, MySQL 8.0 also provides the BIN_TO_UUID function, which can reverse a binary value back to a UUID string.

However, before MySQL 8.0, there were no UUID_TO_BIN/BIN_TO_UUID functions. But you can still solve it by using user-defined functions (UDF). For example, create the following functions:

CREATE FUNCTION MY_UUID_TO_BIN(_uuid BINARY(36))
    RETURNS BINARY(16)
    LANGUAGE SQL DETERMINISTIC CONTAINS SQL SQL SECURITY INVOKER
    RETURN UNHEX(CONCAT(
        SUBSTR(_uuid, 15, 4),
        SUBSTR(_uuid, 10, 4),
        SUBSTR(_uuid, 1, 8),
        SUBSTR(_uuid, 20, 4),
        SUBSTR(_uuid, 25)
    ));

CREATE FUNCTION MY_BIN_TO_UUID(_bin BINARY(16))
    RETURNS CHAR(36)
    LANGUAGE SQL DETERMINISTIC  CONTAINS SQL  SQL SECURITY INVOKER
    RETURN LCASE(CONCAT_WS('-', 
        HEX(SUBSTR(_bin, 5, 4)),
        HEX(SUBSTR(_bin, 3, 2)),
        HEX(SUBSTR(_bin, 1, 2)),
        ...continued...
HEX(SUBSTR(_bin,  9, 2)),

HEX(SUBSTR(_bin, 11)) ));

Therefore, for the table User created in section 04, its primary key can be modified to BINARY(16) to store the sorted 16-byte UUID value. The modified table structure is as follows:

CREATE TABLE User (

    id  BINARY(16) NOT NULL,

    name VARCHAR(255) NOT NULL,

    sex CHAR(1) NOT NULL,

    password VARCHAR(1024) NOT NULL,

    money INT NOT NULL DEFAULT 0,

    register_date DATETIME(6) NOT NULL DEFAULT CURRENT_TIMESTAMP(6),

    last_modify_date DATETIME(6) NOT NULL DEFAULT CURRENT_TIMESTAMP(6) ON UPDATE CURRENT_TIMESTAMP(6),

    uuid CHAR(36) AS (BIN_TO_UUID(id)),

    CONSTRAINT chk_sex CHECK (sex = 'M' OR sex = 'F'),

    PRIMARY KEY(id)

);

Now, you can insert data into the User table in the client using the following SQL command:

INSERT INTO User VALUES (UUID_TO_BIN(UUID(),TRUE),...);

Certainly, many students are also concerned about the performance and storage space consumption of UUIDs. Here I have also performed related tests on insertion performance, and the results are shown in the following table:

![图片3.png](../images/CioPOWCbQCOAIdiZAACr8OD6suQ547.png)

As you can see, the performance of sorting UUIDs provided by MySQL 8.0 is the best, even better than auto-incremented IDs. Furthermore, since the result of UUID_TO_BIN is 16 bytes, which is only 8 bytes more than auto-incremented IDs, the storage space consumption is only increased by 3GB compared to auto-incremented IDs.

Moreover, because UUID ensures global uniqueness, the benefits of using UUIDs are much greater than using auto-incremented IDs. You may be used to using auto-incremented IDs as primary keys, but in the scenario of massive concurrent Internet business, it is recommended to use globally unique values like UUIDs as primary keys.

For example, I particularly recommend the user table structure design in the gaming industry, where UUIDs are used as primary keys instead of auto-incremented IDs. In the event of a server merge, because UUIDs are globally unique, the user-related data can be directly merged, while auto-incremented IDs require additional program integration of data with the same server ID, which is a massive and error-prone task.

#### Custom Generated Business Keys

Of course, although UUIDs are good, in a distributed database scenario, primary keys need to include some additional information to ensure the query efficiency of secondary indexes (this part will be introduced in the later distributed chapter). **For now, just remember that using only UUID as a primary key is not enough for a distributed database architecture.** Therefore, for the core business table in a distributed architecture, I recommend designs similar to the following, for example:

PK = time field + random code (optional) + business information 1 + business information 2 ...

In the e-commerce business, the order table is one of its most critical tables. You can open the Taobao app and query your order number, and you will find similar order information:

![Drawing 3.png](../images/CioPOWCTkyaAfjdyAAWAHcsp4TA536.png)

The image above shows my own Taobao order information (the order number of the first order is 1550672064762308113).

The order number is obviously the primary key of the order table, but if you think the order number is an auto-incremented integer, you are completely wrong. Because if you observe carefully, you will find that the last 6 digits of all order numbers in the image are the same, i.e., **308113**:

1550672064762308113

1481195847180308113

1431156171142308113

1431146631521308113

Therefore, I believe that the last 6 digits of the Taobao order number are related to the user ID, and the first 14 digits are related to the time field, which ensures the incremental nature of the insertion and retains the business-related information for subsequent distributed queries.

#### Eliminating Redundancy

Eliminating redundancy is also a requirement of normalization, which solves partial dependency and transitive dependency, essentially minimizing redundant data.

Therefore, when designing table structures, data should only be stored in one place, and other tables should use primary keys to associate and store. For example, if the order table needs to store user information, only the user ID needs to be saved:

CREATE TABLE Orders (

  order_id VARCHAR(20),

  user_id  BINARY(16),

  order_date datetime,

  last_modify_date datetime

  ...

  PRIMARY KEY(order_id),

  KEY(user_id,order_date)

  KEY(order_date),

  KEY(last_modify_date)

)

Of course, whether it is auto-incremented primary key design, UUID primary key design, custom generated business keys, or elimination of redundancy, they all essentially follow the normalization guidelines. However, in some other business scenarios, denormalized designs also exist.

#### Denormalized Design

Denormalized designs are usually used in OLAP data analysis scenarios, but with the popularity of JSON data types, denormalized designs can also be used in online MySQL business.

In section 04, I discussed the UserTag table, which uses the JSON data type for denormalized design. If we use a normalized design, the UserTag table should be designed as follows:

CREATE TABLE UserTag (

    userId BIGINT NOT NULL,

    userTag INT NOT NULL,

    PRIMARY KEY(userId,userTag)

);

SELECT * FROM UserTag;

+--------+---------+

| userId | userTag |

+--------+---------+

|      1 |   2     |

|      1 |   6     |

|      1 |   8     |

|      1 |  10     |

|      2 |   3     |

|      2 |  10     |

|      2 |  12     |

+--------+---------+

Upon comparison, you can see that the normalized design is not more efficient than using the JSON data type. Using the JSON data type, the userID only needs to be stored once, reducing data redundancy to some extent:

+--------+---------------+

| userId | userTags      |

+--------+---------------+

|      1 | [2, 6, 8, 10] |

|      2 | [3, 10, 12]   |

+--------+---------------+

### Conclusion

In summary, normalization is a database theory-based guideline for table structure design, and it is not necessary to strictly follow the requirements of the three normal forms in practical engineering practice. **In MySQL massive concurrent engineering practice, table structure design should follow the following guidelines:**

- Every table must have a primary key;
- Auto-incremented primary keys are only recommended for non-core business tables and should be avoided if possible;
- For core business tables, UUIDs or custom generated business keys are recommended as primary keys;
- Data should be stored in only one place, and redundant data should be avoided by using primary key associations for querying;
- In some scenarios, denormalized designs using the JSON data type can enhance storage efficiency.