02 String Types the Essential Collation

02 String Types - The Essential COLLATION #

Today I want to talk to you about the sorting rules of string types.

In the previous lecture, we learned how to correctly use numeric types in table structure design. In addition to numeric types, string types are also commonly used in table structure design to describe specific information.

MySQL database has string types like CHAR, VARCHAR, BINARY, BLOB, TEXT, ENUM, and SET. Different types have completely different performances in terms of business design and database performance. The most commonly used ones are CHAR and VARCHAR. Today, I will take you to delve into the application of string types CHAR and VARCHAR. I hope that after studying this lecture, you can truly use the string types of MySQL and design a more beautiful business table structure.

Definition of CHAR and VARCHAR #

CHAR(N) is used to store fixed-length characters, where N ranges from 0 to 255. Please remember that N represents characters, not bytes. VARCHAR(N) is used to store variable-length characters, where N ranges from 0 to 65536, and N represents characters.

In cases where the number of characters exceeds 65536, you can consider using larger character types like TEXT or BLOB. Both have a maximum storage length of 4G, but the difference is that BLOB does not have a character set property and is purely used for binary storage.

Unlike traditional relational databases such as Oracle and Microsoft SQL Server, the VARCHAR character type of MySQL database can store a maximum of 65536 characters, so in most scenarios under MySQL database, using the VARCHAR type is sufficient.

Character Set #

In table structure design, in addition to defining columns as CHAR and VARCHAR for storing characters, you also need to define the corresponding character set for the characters, because each character has different binary values ​​under different character set encodings. Common character sets include GBK and UTF8, and it is usually recommended to set the default character set to UTF8.

With the rapid development of mobile internet, it is recommended to set the default character set of MySQL to UTF8MB4. Otherwise, some emoji characters cannot be stored in the UTF8 character set, such as the emoji laughing face, which corresponds to the character code 0xF09F988E:

If you forcibly insert emoji characters on a column with a character set of UTF8, MySQL will throw the following error message:

mysql> SHOW CREATE TABLE emoji_test\G

*************************** 1. row ***************************
       
       Table: emoji_test

Create Table: CREATE TABLE `emoji_test` (

  `a` varchar(100) CHARACTER SET utf8,

  PRIMARY KEY (`a`)

) ENGINE=InnoDB DEFAULT CHARSET=utf8

1 row in set (0.01 sec)

mysql> INSERT INTO emoji_test VALUES (0xF09F988E);

ERROR 1366 (HY000): Incorrect string value: '\xF0\x9F\x98\x8E' for column 'a' at row 1

Including the MySQL 8.0 version, the character set is set to UTF8MB4 by default. Before version 8.0, the default character set was Latin1. Because of the different default character sets in different versions, you need to explicitly configure the relevant parameters in the configuration file:

[mysqld]

character-set-server = utf8mb4

...

In addition, different character sets have different maximum bytes for CHAR(N) and VARCHAR(N). For example, in the GBK character set, one character can store a maximum of 2 bytes, and in the UTF8MB4 character set, one character can store a maximum of 4 bytes. Therefore, from the perspective of underlying storage kernel, both CHAR and VARCHAR are implemented with variable-length storage in multi-byte character sets!

1.png

From the example above, we can see that CHAR(1) can store both 1 byte of ‘a’ and 4 bytes of the emoji laughing face. Therefore, CHAR is essentially variable-length.

Given that the default character set is now recommended to be set to UTF8MB4, in table structure design, CHAR can be replaced by VARCHAR, and the underlying storage implementation is exactly the same.

Sorting Rules #

Sorting rules (Collation) are rules for comparing and sorting strings. Each character set has a default sorting rule, which you can view using the SHOW CHARSET command:

mysql> SHOW CHARSET LIKE 'utf8%';

+---------+---------------+--------------------+--------+

| Charset | Description   | Default collation  | Maxlen |

+---------+---------------+--------------------+--------+

| utf8    | UTF-8 Unicode | utf8_general_ci    |      3 |

| utf8mb4 | UTF-8 Unicode | utf8mb4_0900_ai_ci |      4 |

+---------+---------------+--------------------+--------+

2 rows in set (0.01 sec)

mysql> SHOW COLLATION LIKE 'utf8mb4%';

+----------------------------+---------+-----+---------+----------+---------+---------------+

| Collation                  | Charset | Id  | Default | Compiled | Sortlen | Pad_attribute |

+----------------------------+---------+-----+---------+----------+---------+---------------+

| utf8mb4_0900_ai_ci         | utf8mb4 | 255 | Yes     | Yes      |       0 | NO PAD        |

| utf8mb4_0900_as_ci         | utf8mb4 | 305 |         | Yes      |       0 | NO PAD        |

    | utf8mb4_0900_as_cs         | utf8mb4 | 278 |         | Yes      |       0 | NO PAD        |
    
    | utf8mb4_0900_bin           | utf8mb4 | 309 |         | Yes      |       1 | NO PAD        |
    
    | utf8mb4_bin                | utf8mb4 |  46 |         | Yes      |       1 | PAD SPACE     |
    
    ......
    

The sorting rule ending with _ci means case-insensitive. _cs means case-sensitive, and _bin means binary comparison based on storing character binary. It should be noted that by default, MySQL compares strings in a case-insensitive sorting order:

    mysql> SELECT 'a' = 'A';
    
    +-----------+
    
    | 'a' = 'A' |
    
    +-----------+
    
    |         1 |
    
    +-----------+
    
    1 row in set (0.00 sec)
    
    mysql> SELECT CAST('a' as char) COLLATE utf8mb4_0900_as_cs = CAST('A' as CHAR) COLLATE utf8mb4_0900_as_cs as result;
    
    +--------+
    
    | result |
    
    +--------+
    
    |      0 |
    
    +--------+
    
    1 row in set (0.00 sec)

**Remember, for the majority of business data tables, there is no need to set the sorting rule as case-sensitive! Only when you truly understand that your business requires it.**

### Correctly modify character set

Of course, it is believed that many businesses did not consider the influence of character sets on business data storage during design, so character set conversion is required later. However, many students will find that they still cannot insert UTF8MB4 characters, such as emojis, after executing the following operation:

    ALTER TABLE emoji_test CHARSET utf8mb4;

Actually, the above modification only changes the character set of the table to UTF8MB4. The next time a new column is added, if the character set is not explicitly specified, the character set of the new column will be changed to UTF8MB4. **However, for existing columns, the default character set is not changed**. You can confirm this by using the SHOW CREATE TABLE command:

    mysql> SHOW CREATE TABLE emoji_test\G

    *************************** 1. row ***************************
    
           Table: emoji_test
    
    Create Table: CREATE TABLE `emoji_test` (
    
      `a` varchar(100) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
    
      PRIMARY KEY (`a`)
    
    ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
    
    1 row in set (0.00 sec)

As you can see, the character set of column 'a' is still UTF8, not UTF8MB4. Therefore, the correct command to modify the character set of the column should use ALTER TABLE...CONVERT TO... in order to change the character set of column 'a' from UTF8 to UTF8MB4:

    mysql> ALTER TABLE emoji_test CONVERT TO CHARSET utf8mb4;

    Query OK, 0 rows affected (0.94 sec)
    
    Records: 0  Duplicates: 0  Warnings: 0
    
    mysql> SHOW CREATE TABLE emoji_test\G

    *************************** 1. row ***************************
    
           Table: emoji_test
    
    Create Table: CREATE TABLE `emoji_test` (
    
      `a` varchar(100) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci NOT NULL,
    
      PRIMARY KEY (`a`)
    
    ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
    
    1 row in set (0.00 sec)

At this point, we have learned the basics of string-related knowledge. Next, let's move on to the practical design of the string structure in MySQL tables. I hope you can use string types effectively in your designs.

### Practical business table structure design
#### User Gender Design

When designing table structures, you will encounter fields with fixed option values. For example, the gender field (Sex) has only two options: male or female. Another example is the status field (State), which has limited valid values such as running, stopped, restarted, etc.

After observation, I found that most developers like to use INT as the data type to store the gender field, for example:

```sql
CREATE TABLE `User` (
  `id` bigint NOT NULL AUTO_INCREMENT,
  `sex` tinyint DEFAULT NULL,
  ......
  PRIMARY KEY (`id`)
) ENGINE=InnoDB;

In this case, the tinyint column sex represents the gender of the user, but this design has some obvious problems:

  • Unclear representation: When storing the data, does 0 represent female or does 1 represent female? Each business may have different unwritten rules.
  • Dirty data: Because it is a tinyint, users can insert values like 2, 3, or 4, resulting in the possibility of invalid data in the table. Cleaning up this data later can be very costly.

Before MySQL 8.0, you can use the ENUM string enumeration type, which only allows the insertion of defined values. If the SQL_MODE parameter is set to strict mode, inserting undefined data will result in an error:

mysql> SHOW CREATE TABLE User\G
*************************** 1. row ***************************
       Table: User
Create Table: CREATE TABLE `User` (
  `id` bigint NOT NULL AUTO_INCREMENT,
  `sex` enum('M','F') COLLATE utf8mb4_general_ci DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB
1 row in set (0.00 sec)

mysql> SET sql_mode = 'STRICT_TRANS_TABLES';
Query OK, 0 rows affected, 1 warning (0.00 sec)

mysql> INSERT INTO User VALUES (NULL,'F');
Query OK, 1 row affected (0.08 sec)

mysql> INSERT INTO User VALUES (NULL,'A');
ERROR 1265 (01000): Data truncated for column 'sex' at row 1

However, since the ENUM type is not a SQL standard data type but a string type unique to MySQL, the error message thrown is not very intuitive. This implementation has some limitations, mainly because the versions of MySQL prior to 8.0 did not provide constraint functionality. Starting from MySQL version 8.0.16, the database natively provides the CHECK constraint functionality, which easily allows the design of limited-state column types:

mysql> SHOW CREATE TABLE User\G
*************************** 1. row ***************************
       Table: User
Create Table: CREATE TABLE `User` (
  `id` bigint NOT NULL AUTO_INCREMENT,
  `sex` char(1) COLLATE utf8mb4_general_ci DEFAULT NULL,
  PRIMARY KEY (`id`),
  CONSTRAINT `user_chk_1` CHECK (((`sex` = _utf8mb4'M') or (`sex` = _utf8mb4'F')))
) ENGINE=InnoDB
1 row in set (0.00 sec)

mysql> INSERT INTO User VALUES (NULL,'M');
Query OK, 1 row affected (0.07 sec)

mysql> INSERT INTO User VALUES (NULL,'Z');
ERROR 3819 (HY000): Check constraint 'user_chk_1' is violated.

In the above code, you can see that the constraint definition user_chk_1 on line 8 defines the value range for the sex column, which can only be M or F. Additionally, when inserting the invalid data Z on line 15, MySQL explicitly throws a violation of the constraint error.

Account Password Storage Design #

Remember, when designing database table structures, never store passwords directly in the database table. Once a malicious user gains access to the system, there is a high risk of user data leakage. For example, in the financial industry, from a compliance perspective, all user privacy fields need to be encrypted. The business itself may not even know the actual stored information (such as login passwords, mobile numbers, credit card information, etc.) in order to protect user privacy. Many developers use the MD5 function to encrypt and store sensitive data, and there is nothing wrong with that because the MD5 algorithm is not reversible. However, the encrypted value of MD5 is fixed. For example, the password “12345678” corresponds to the fixed MD5 value “25d55ad283aa400af464c76d713c07ad”.

As a result, it is possible to perform a brute force attack on MD5 and calculate the MD5 values of all possible strings. If it is not feasible to enumerate all possible string combinations, common passwords such as “111111” and “12345678” can be calculated. I have included a website in this document that can be used to decrypt MD5 encrypted strings online.

Therefore, when designing password storage, it is necessary to use a salt. Each company has a different salt value, so the calculated values are also different. If the salt value is “psalt”, then the password “12345678” in the database would be:

password = MD5('psalt12345678')

This type of password storage design is a fixed salt value encryption algorithm, which has three main problems:

  • If the salt value is leaked by a (former) employee, there is still a possibility of brute force attack by external hackers.
  • For the same password, the password storage values are the same. Once a user’s password is leaked, the passwords of other users with the same password will also be leaked.
  • The fixed use of the MD5 encryption algorithm. Once the MD5 algorithm is cracked, the impact will be significant.

Therefore, a truly good password storage design should include: dynamic salt + non-fixed encryption algorithm.

I highly recommend designing passwords in this way. The format of the password stored in the column can be as follows:

$salt$cryption_algorithm$value

Where:

  • $salt: represents the dynamic salt, which is generated by the business each time a user registers and is stored in the database. If you want to be more sophisticated, you can merge the dynamic salt value with the user’s registration date to create a more dynamic salt value.
  • $cryption_algorithm: represents the encryption algorithm used, such as v1 for MD5 encryption algorithm, v2 for AES256 encryption algorithm, v3 for AES512 encryption algorithm, etc.
  • $value: represents the encrypted string.

In this case, the structure design of the User table would look like this:

CREATE TABLE User (
    id BIGINT NOT NULL AUTO_INCREMENT,
    name VARCHAR(255) NOT NULL,
    sex CHAR(1) NOT NULL,
    password VARCHAR(1024) NOT NULL,
    regDate DATETIME NOT NULL,
    CHECK (sex = 'M' OR sex = 'F'),
    PRIMARY KEY(id)
);

SELECT * FROM User\G

*************************** 1. row ***************************
  id: 1
name: David
sex: M
password: $fgfaef$v1$2198687f6db06c9d1b31a030ba1ef074
regDate: 2020-09-07 15:30:00
*************************** 2. row ***************************
  id: 2
name: Amy
sex: F
password: $zpelf$v2$0x860E4E3B2AA4005D8EE9B7653409C4B133AF77AEF53B815D31426EC6EF78D882
regDate: 2020-09-07 17:28:00

In the example above, both users David and Amy have the password “12345678”, but due to the use of dynamic salt and dynamic encryption algorithm, the contents stored are completely different.

Even if a malicious user obtains the current password encryption algorithm, with the encryption algorithm version represented by $cryption_algorithm, the stored passwords can be upgraded, further enhancing protection against malicious data attacks.

Summary #

Strings are one of the most widely used data types, but they are also the part that is most prone to mistakes during design. Making changes to it after the business is up and running can be extremely costly. I hope you will carefully read the content of this lecture multiple times, so that you can make ample preparations at the beginning of table structure design. I will summarize the key points of this section:

  • Although CHAR and VARCHAR are used to store fixed-length and variable-length characters respectively, for variable-length character sets (such as GBK, UTF8MB4), their essence is the same, both are variable-length. When designing, VARCHAR can completely replace CHAR.
  • It is recommended to set the default character set of MySQL to UTF8MB4, which can be used to store extended characters such as emojis.
  • Sorting rules are important for character comparison and sorting, but in most cases, case-insensitive sorting rules are not necessary.
  • To modify the character set of an existing column in a table, use the command ALTER TABLE … CONVERT TO ….
  • For columns with limited values such as user gender and running status, MySQL 8.0.16 version directly uses the CHECK constraint mechanism. In the previous versions, the ENUM string type plus the strict mode of SQL_MODE can be used.
  • Business privacy information such as passwords, phones, and credit cards need to be encrypted. Remember that simple MD5 algorithm can be cracked, so it is not secure. It is recommended to use a dynamic salt and dynamic encryption algorithm for storing sensitive data.