11 How to Add an Index to a String Field #

Nowadays, almost all systems support email login. Today, we are going to discuss how to build a proper index on fields like email.

Assume that you are maintaining a system that supports email login, and the user table is defined as follows:

mysql> create table SUser(
ID bigint unsigned primary key,
email varchar(64), 
... 
)engine=innodb;

Since email login is used, there will definitely be statements like this in the business code:

mysql> select f1, f2 from SUser where email='xxx';

From the explanations of indexes in the 4th and 5th articles, we know that if there is no index on the email field, this statement will perform a full table scan.

Meanwhile, MySQL supports prefix indexes, which means that you can define a part of the string as an index. By default, if you do not specify the prefix length in the index creation statement, the index will include the entire string.

For example, these two statements create indexes on the email field:

mysql> alter table SUser add index index1(email);
or
mysql> alter table SUser add index index2(email(6));

The index1 index created by the first statement contains the entire string of each record, while the index2 index created by the second statement only takes the first 6 bytes for each record.

So, what are the differences in data structure and storage between these two different definitions? As shown in Figures 2 and 3, these are the schematic diagrams of these two indexes.

Figure 1 Email index structure

Figure 2 Email(6) index structure

From the diagrams, you can see that since the index structure of email(6) only takes the first 6 bytes of each email field (i.e., ‘zhangs’), it occupies less space. This is the advantage of using prefix indexes.

However, this also results in additional record scanning.

Next, let’s look at the following statement and see how it is executed under these two index definitions:

select id,name,email from SUser where email='[[email protected]](/cdn-cgi/l/email-protection)';

If index1 is used (i.e., the index structure of the entire string of email), the execution order is as follows:

Find the record that satisfies the index value ‘[email protected]’ from the index1 index tree and obtain the value of ID2.
Look up the row in the primary key by the primary key value ID2, verify the email value, and add this row to the result set.
Retrieve the next record at the location just found on the index1 index tree. If it no longer satisfies the condition email = ‘[email protected]’, end the loop.

In this process, data is only retrieved once from the primary key index, so the system considers that only one row is scanned.

If index2 is used (i.e., the index structure of email(6)), the execution order is as follows:

Find the record that satisfies the index value ‘zhangs’ from the index2 index tree, and the first one found is ID1.
Look up the row in the primary key by the primary key value ID1 and determine that the email value is not ‘[email protected]’. Discard this row.
Retrieve the next record at the location just found on index2, find that it is still ‘zhangs’, retrieve ID2, retrieve the whole row from the ID index, verify the value, and add this row to the result set.
Repeat the previous step until the value retrieved from index2 is not ‘zhangs’, end the loop.

In this process, data needs to be retrieved from the primary key index four times, which means that four rows are scanned.

From this comparison, you can easily see that using a prefix index may increase the number of data retrieval operations for a query statement.

However, for this query statement, if you define index2 as email(7) instead of email(6), which means using the first 7 bytes of the email field to build the index, there is only one record that satisfies the prefix ‘zhangss’, so ID2 can be directly found, and the loop ends with only one row scanned.

In other words, with prefix indexes and properly defined lengths, you can save space without significantly increasing query costs.

So, you have a question: When creating a prefix index for a string, how can you determine the length of the prefix you should use?

In fact, what we focus on when building an index is selectivity. The higher the selectivity, the better. Because a higher selectivity means fewer duplicated key values. Therefore, we can determine the length of the prefix based on the number of distinct values on the index.

First, you can use the following statement to calculate the number of distinct values on the column:

mysql> select count(distinct email) as L from SUser;

Then, select different lengths of prefixes one by one to see this value. For example, if we want to see the prefix indexes of 4 to 7 bytes, we can use this statement:

mysql> select 
  count(distinct left(email,4)) as L4,
  count(distinct left(email,5)) as L5,
  count(distinct left(email,6)) as L6,
  count(distinct left(email,7)) as L7,
from SUser;

Of course, using a prefix index may reduce selectivity, so you need to set an acceptable loss ratio in advance, such as 5%. Then, among the returned L4 to L7 values, find the values that are not less than L × 95%. Let’s assume that both L6 and L7 satisfy this condition. You can then choose a prefix length of 6.

The impact of prefix indexes on covering indexes #

Previously, we mentioned that using a prefix index may increase the number of scanned rows, which affects performance. In fact, the impact of prefix indexes is not limited to this. Let’s take a look at another scenario.

First, consider the following SQL statement:

select id, email from SUser where email='[[email protected]](/cdn-cgi/l/email-protection)';

Compared to the SQL statement in the previous example:

select id, name, email from SUser where email='[[email protected]](/cdn-cgi/l/email-protection)';

This statement only requires the fields id and email to be returned.

Therefore, if index1 (the index structure for the entire email string) is used, the covering index can be utilized to directly return the results after querying index1, without the need to return to the ID index for another query. However, if index2 (the index structure for email(6)) is used, it is necessary to return to the ID index to determine the value of the email field.

Even if the definition of index2 is modified to the prefix index email(18), although index2 now contains all the information, InnoDB still needs to return to the ID index to check because the system is not certain whether the prefix index definition has truncated the complete information.

In other words, using a prefix index does not take advantage of the query performance optimization provided by the covering index. This is also a factor to consider when deciding whether to use a prefix index.

Other Approaches #

For fields like email, using a prefix index may yield good results. However, what do we do when faced with a prefix that does not have sufficient distinguishment?

For example, in our country, the ID card number consists of 18 digits, with the first 6 digits representing the address code. Therefore, for people in the same county, the first 6 digits of their ID card numbers are likely to be the same.

Assuming you maintain a database for a city’s citizen information system, if you create a prefix index of length 6 for the ID card number, the distinguishability of this index would be very low.

Following the method mentioned earlier, you might need to create a prefix index of length greater than 12 to satisfy the distinguishability requirement.

However, the longer the selected index, the more disk space it occupies, and the fewer index values can fit in the same data page, resulting in reduced search efficiency.

So, if we can determine that the business requirement only includes equality queries based on the ID card, are there any other approaches available? These approaches can both occupy less space and achieve the same query efficiency.

The answer is yes.

The first approach is using reverse storage. If you store the ID card number in reverse order, you can write your query as follows:

mysql> select field_list from t where id_card = reverse('input_id_card_string');

Since the last 6 digits of the ID card number do not have repeated logic like the address code, these last 6 digits likely provide sufficient distinguishability. Of course, in practice, don’t forget to use the count(distinct) method to perform verification.

The second approach is using a hash field. You can create an additional integer field in the table to store the checksum of the ID card number and create an index on this field:

mysql> alter table t add id_card_crc int unsigned, add index(id_card_crc);

Then, every time a new record is inserted, calculate the checksum using the crc32() function and store it in this new field. Since checksum conflicts may exist, meaning two different ID card numbers may produce the same result using the crc32() function, your query statement’s WHERE clause needs to check if the values of id_card and id_card_crc are exactly the same:

mysql> select field_list from t where id_card_crc=crc32('input_id_card_string') and id_card='input_id_card_string';

In this way, the length of the index becomes 4 bytes, which is much smaller than before.

Next, let’s take a look at the similarities and differences between using reverse storage and using a hash field.

First, the common point is that neither of them supports range queries. The index created on the reverse storage field is sorted in reverse string order, so it is no longer possible to use the index to query all citizens with ID numbers between ID_X and ID_Y. Similarly, the hash field method only supports exact match queries.

The differences between them are mainly reflected in the following three aspects:

In terms of additional storage space, reverse storage does not consume additional storage space on the primary key index, while the hash field method requires adding an additional field. Of course, the reverse storage method using a 4-byte prefix length may not be enough. If it is longer, the additional space consumption will be approximately offset by the additional hash field.
In terms of CPU consumption, the reverse storage method requires an additional call to the reverse function for each write and read, while the hash field method requires an additional call to the crc32() function. If we only consider the computational complexity of these two functions, the additional CPU resources consumed by the reverse function will be smaller.
In terms of query efficiency, the query performance of the hash field method is relatively more stable. Although there may be collisions in the values calculated by crc32, the probability is very small, so the average number of scanned rows for each query can be considered close to 1. On the other hand, the reverse storage method still uses a prefix index, which means that it will increase the number of scanned rows.

Conclusion #

In today’s article, I talked to you about scenarios where you create indexes on string fields. Let’s review, here are the ways you can use:

Directly create a complete index, which may occupy more space.
Create a prefix index to save space, but it will increase the number of query scans and cannot use covered indexes.
Use reverse storage and then create a prefix index to overcome the problem of insufficient distinction of string prefixes.
Create a hash field index, which has stable query performance but has additional storage and calculation costs. Similar to the third method, neither of them supports range scans.

In practical applications, you should choose which method to use based on the characteristics of the business fields.

Well, it’s time for the final question.

If you are maintaining a student information database for a school, and the unified format for student login names is “studentID@gmail.com”, and the rules for student IDs are as follows: a 15-digit number where the first three digits are the city code, the fourth to sixth digits are the school code, the seventh to tenth digits are the enrollment year, and the last five digits are the sequential number.

When students log in, they need to enter their login names and passwords for verification before they can continue using the system. Considering only the behavior of login validation, how would you design the index for the login name?

Please write your analysis and design results in the comments section, and I will discuss this question with you in the next article. Thank you for your listening, and feel free to share this article with more friends to read.

Answer to the Previous Question #

In the previous article, some students mentioned that they couldn’t reproduce the first example. Please check if the isolation level is RR (Repeatable Read) and if the table t is using the InnoDB engine. I have created a video of the reproduction process for your reference.

In the last article, I left you with a question: why does the explain result become incorrect after this operation sequence? Here, I will analyze the reasons for you.

The delete statement deletes all the data, and then 100,000 rows of data are inserted through the call idata() function, which seems to overwrite the previous 100,000 rows.

However, session A opens a transaction but does not commit, so the 100,000 rows of data previously inserted cannot be deleted. In this way, each row of the previous data has two versions: the old version is the data before the delete, and the new version is the data marked as deleted.

Therefore, the data on the index a actually has two copies.

Then you might say, wait, the data on the primary key should not be deleted either. For the statement that does not use the force index, why is the scan row count shown by the explain command still around 100,000? (Implicitly, if this also doubles, perhaps the optimizer will consider it more appropriate to select the field a as the index.)

Yes, you are right. However, this is the primary key, and the primary key is estimated based on the number of rows in the table. The optimizer directly uses the value from the show table status command.

I will explain how this value is calculated in detail in a future article.