09 Choosing Between Ordinary Indexes and Unique Indexes

09 Choosing Between Ordinary Indexes and Unique Indexes #

Before we begin today’s main content, I would like to express my gratitude to several students who left high-quality comments in the comment section.

A student with the username @某、人 provided a summary of the key points of the article and raised a question about transaction visibility, specifically the impact of transactions that are started but not yet committed on data visibility. @夏日雨 also mentioned this issue, and I replied to it in the pinned comment. I will further explain it at the end of today’s article. Two other students, @Justin and @倪大人, raised two good questions.

For questions that stimulate further thinking, I will write “good question” in my reply to make it easier for you to search. You can also take a look at their comments.

I really appreciate the fact that you have carefully read the article and left so many high-quality comments. It’s a great encouragement for me to know that the article has brought new understanding to everyone. At the same time, it gives other students who carefully read the comment section the opportunity to discover knowledge points that they may not have realized yet but may not be clear about. This overall improves the quality of the entire column. Thank you again.

Okay, now let’s get back to our main content for today.

In previous articles in the basic section, I introduced the basic concepts of indexes, and I believe you already understand the difference between unique indexes and regular indexes. Today, let’s continue to talk about when to choose a regular index and when to choose a unique index in different business scenarios.

Let’s say you are maintaining a citizen system where each person has a unique ID card number, and the business code ensures that duplicate ID card numbers are not written. If the citizen system needs to search for names based on ID card numbers, you would execute a SQL statement like this:

select name from CUser where id_card = 'xxxxxxxyyyyyyzzzzz';

Therefore, you would definitely consider creating an index on the id_card field.

Since the ID card number field is quite long, I do not recommend using it as the primary key. So now you have two choices: either create a unique index on the id_card field or create a regular index. If the business code already ensures that duplicate ID card numbers are not written, both of these choices are logically correct.

Now, my question to you is, from a performance perspective, would you choose a unique index or a regular index? What is the basis for your choice?

For simplicity, let’s use the example from the 4th article (“A Deep Dive into Indexes Part 1”) to illustrate. Let’s assume that the values on the field k are all unique.

img

Figure 1: InnoDB’s index organization structure

Next, we will analyze the performance impact of these two types of indexes on query statements and update statements.

Query Process #

Let’s assume that the query statement is select id from T where k=5. The process of searching for this query statement in the index tree starts with traversing the B+ tree from the root, searching layer by layer until reaching the leaf node, which is the data page in the bottom right corner of the diagram. We can consider that within the data page, the records are located using binary search.

  • For a regular index, after finding the first record (5,500) that satisfies the condition k=5, it needs to continue searching for the next record until the first record that does not satisfy the condition k=5 is encountered.
  • For a unique index, since the index defines uniqueness, after finding the first record that satisfies the condition k=5, the search stops.

So, how much performance difference does this make? The answer is very minimal.

As you know, InnoDB reads and writes data in units of data pages. This means that when you need to read a record, it is not read from the disk individually, but rather the entire page that contains the record is loaded into memory. In InnoDB, the default size of each data page is 16KB.

Because the engine reads and writes data by page, when the record with k=5 is found, the data pages it belongs to are already in memory. Therefore, for a regular index, the additional “search and check the next record” operation only requires one pointer lookup and one calculation.

Of course, if the record with k=5 happens to be the last record on this data page, fetching the next record requires reading the next data page, which adds a slightly more complex operation.

However, as we calculated before, for an integer field, one data page can hold nearly a thousand keys. Therefore, the probability of encountering this situation is very low. So, when calculating the average performance difference, we can still consider this operation cost to be negligible for today’s CPUs.

Update Process #

In order to explain the impact of ordinary indexes and unique indexes on the performance of update statements, I need to introduce the concept of the change buffer.

When updating a data page, if the data page is already in memory, it is directly updated. However, if the data page is not in memory, InnoDB stores these update operations in the change buffer, without having to read the data page from disk, as long as it does not affect data consistency. When the data page needs to be accessed in the next query, the data page is read into memory and the operations related to this page in the change buffer are executed. This ensures the correctness of the logical data.

It should be noted that although it is called the change buffer, it is actually persistent data. In other words, the change buffer has a copy in memory and is also written to disk.

The process of applying the operations in the change buffer to the original data page to get the latest results is called merge. In addition to accessing the data page triggering the merge, there are background threads that periodically perform the merge. The merge operation is also performed during the normal shutdown of the database.

Obviously, if update operations can be recorded in the change buffer first, reducing the need to read from disk, the execution speed of the statements will be significantly improved. Moreover, reading data into memory requires using the buffer pool, so this approach can also avoid occupying memory and improve memory utilization.

So, under what conditions can the change buffer be used?

For a unique index, all update operations need to first check whether this operation violates uniqueness constraints. For example, to insert the record (4,400), it needs to check whether there is already a record with k=4 in the table, and this can only be determined by reading the data page into memory. If the data page has already been read into memory, it is faster to update the memory directly and there is no need to use the change buffer.

Therefore, the change buffer cannot be used for updates on unique indexes, in fact, only ordinary indexes can be used.

The change buffer uses the memory in the buffer pool, so it cannot be increased indefinitely. The size of the change buffer can be dynamically set using the innodb_change_buffer_max_size parameter. When this parameter is set to 50, it means that the change buffer can occupy at most 50% of the buffer pool.

Now that you understand the mechanism of the change buffer, let’s take a look at the processing flow of InnoDB when inserting a new record (4,400) in this table.

The first case is when the target page to be updated is already in memory. In this case, the processing flow of InnoDB is as follows:

  • For a unique index, find the position between 3 and 5, check for conflicts, insert the value, and the statement execution ends.
  • For an ordinary index, find the position between 3 and 5, insert the value, and the statement execution ends.

In this case, the difference in performance of update statements between ordinary indexes and unique indexes is only due to the check, which consumes minimal CPU time.

However, this is not our main focus.

The second case is when the target page to be updated is not in memory. In this case, the processing flow of InnoDB is as follows:

  • For a unique index, the data page needs to be read into memory, check for conflicts, insert the value, and the statement execution ends.
  • For an ordinary index, the update is recorded in the change buffer, and the statement execution ends.

Reading data from disk into memory involves random I/O access, which is one of the most costly operations in the database. The change buffer significantly improves update performance by reducing random disk access.

I once encountered a situation where a DBA colleague told me that the cache hit rate of a business’s database suddenly dropped from 99% to 75%, causing the whole system to be blocked and all update statements were blocked. After investigating the reason, I found that this business had a large number of data insertion operations, and he changed one of the ordinary indexes to a unique index the day before.

Usage Scenario of the Change Buffer #

From the above analysis, you already understand the acceleration effect of using the change buffer in the update process, and you also understand that the change buffer is only applicable to scenarios with ordinary indexes, not to unique indexes. So now there is a question: does using the change buffer have an acceleration effect on all scenarios of ordinary indexes?

Since the merge is the moment when actual data updates are performed, and the main purpose of the change buffer is to cache the changes to records, the more changes recorded in the change buffer before merging on a data page (that is, the more times a page needs to be updated), the greater the benefit.

Therefore, for businesses with more writes than reads, the probability of a page being accessed immediately after being written is relatively low, and the use of the change buffer is most effective in such cases. This business model is commonly seen in systems such as billing and logging. On the contrary, let’s assume that the update mode of a business is to perform a query immediately after writing. Even if the condition is met, recording the update in the change buffer and then immediately triggering the merge process when accessing the data page will not decrease the number of random access I/O operations, but instead increase the maintenance cost of the change buffer. Therefore, for this type of business mode, the change buffer actually has side effects.

Index Selection and Practices #

Returning to the question we raised at the beginning of the article, how should we choose between a regular index and a unique index? In fact, there is no difference between these two types of indexes in terms of query capabilities, the main consideration is their impact on performance during updates. Therefore, I recommend that you choose regular indexes whenever possible.

If every update is immediately followed by a query on the same record, then you should disable the change buffer. In all other cases, the change buffer can improve update performance.

In practical usage, you will find that the combination of regular indexes and the change buffer is particularly effective in optimizing updates for large tables.

Especially when using mechanical hard drives, the effect of the change buffer mechanism is very significant. So, when you have a database similar to “historical data” and you need to use mechanical hard drives due to cost considerations, you should pay special attention to the indexes in these tables and try to use regular indexes. Additionally, you should increase the size of the change buffer as much as possible to ensure fast data writes for this “historical data” table.

Change Buffer and Redo Log #

Understanding the principles of the change buffer, you might associate it with redo log and WAL (Write-Ahead Logging) that I mentioned in a previous article.

In the comments of the previous article, I noticed that some students were confused between redo log and change buffer. WAL’s core mechanism for performance improvement is indeed to minimize random reads and writes, which can make these concepts confusing. Therefore, I have combined them in the same process to help you differentiate between these two concepts.

Note: Here, you can review the relevant content in the second article [How is an SQL UPDATE Statement Executed?] if you want.

Now, let’s execute this INSERT statement on a table:

mysql> insert into t(id,k) values(id1,k1),(id2,k2);

Assuming the current state of the k index tree is as follows, where the data page containing k1 is in memory (InnoDB buffer pool) and the data page containing k2 is not in memory. The diagram below shows the update status with the change buffer.

img

Figure 2. Update Process with Change Buffer

Analyzing this INSERT statement, you will find that it involves four parts: memory, redo log (ib_log_fileX), data tablespace (t.ibd), and system tablespace (ibdata1).

This INSERT statement performs the following operations (according to the numbers in the diagram):

  1. The page 1 is in memory, so it is updated directly in memory.
  2. The page 2 is not in memory, so the information “I want to insert a row into page 2” is recorded in the change buffer in memory.
  3. The above two actions are recorded in the redo log (numbers 3 and 4 in the diagram).

After completing the above operations, the transaction can be completed. Therefore, you will see that the cost of executing this INSERT statement is very low, as it only writes to two locations in memory, and then writes to disk once (combining two operations into one disk write), and the writes are sequential.

At the same time, the two dashed arrows in the diagram represent background operations that do not affect the response time of the update.

Then, how should we handle read requests that occur after these updates?

For example, let’s execute the statement select * from t where k in (k1, k2). Here, I have drawn the flowchart for these two reading requests.

If the read statements occur shortly after the update statement and the data is still in memory, then these two read operations are independent of the system tablespace (ibdata1) and redo log (ib_log_fileX). Therefore, I did not include these two parts in the diagram. img

Figure 3 Read Process with Change Buffer

From the diagram:

  1. When reading Page 1, the result is directly returned from memory. Some students have asked in the comments of the previous article whether, after the WAL, reading data requires reading from disk and updating the data from the redo log before returning. In fact, it is not necessary. From the state in Figure 3, although the data on the disk is still the old data, the result is returned directly from memory and is correct.
  2. When reading Page 2, Page 2 needs to be read from disk into memory, and then apply the operations in the change buffer to generate a correct version and return the result.

As can be seen, the data page is only loaded into memory when Page 2 needs to be read.

Therefore, if we simply compare the benefits of these two mechanisms in improving update performance, the redo log mainly saves random writes to disk (converted to sequential writes), while the change buffer mainly saves random reads from disk.

Summary #

Today, I started with the choice between regular and unique indexes, shared the process of querying and updating data, explained the mechanism of the change buffer and its use cases, and finally discussed index selection in practice.

Since the unique index cannot benefit from the optimization mechanism of the change buffer, if performance is prioritized and the business can accept it, I recommend giving priority to non-unique indexes.

Finally, it’s time for a reflection question.

From Figure 2, you can see that the change buffer initially writes to memory. So, if the machine loses power and restarts at this time, will the change buffer be lost? Losing the change buffer is a big deal because there won’t be a merge process when reading the data from disk. In other words, the data will be lost. Does this situation occur?

You can share your thoughts and opinions in the comments, and I will discuss this question with you in the next article. Thank you for reading, and feel free to share this article with more friends to read together.

Addendum: There have been many discussions in the comments about “whether to use unique indexes,” mainly focusing on the dilemma of “the business may not be able to guarantee.” Here, I would like to clarify again:

  • First, the correctness of the business comes first. The premise of this article is to discuss performance issues when the “business code has already ensured that duplicate data will not be written.” If the business cannot guarantee this or insists on using the database to enforce constraints, then there is no choice and you must create unique indexes. In this case, the significance of this article is to provide you with additional troubleshooting ideas when encountering slow data insertion and low memory hit rates.
  • Second, in some scenarios with “archival databases,” you can consider using regular indexes. For example, the online data only needs to be retained for six months, and then the historical data is stored in the archival database. At this time, the archival data is already guaranteed not to have unique key conflicts. To improve archival efficiency, you can consider changing the unique indexes in the table to regular indexes.

Previous Question time #

The question from the previous article was: How to create a scenario where “data cannot be modified.” Many students have already provided the correct answers in the comments, but here, I will describe it again.

img In this way, session A will see the effect shown in my screenshot.

Actually, there is another scenario that no one has mentioned in the comments yet.

img

If you run this sequence of operations, the content seen by session A will also reproduce the effect shown in my screenshot. The transaction started by session B’ is earlier than session A, which is actually a little surprise left when we described the visibility rules of transaction versions in the previous article, because the rules also include an “active transaction judgment.” I planned to supplement it here.

When I tried to describe the complete rules, I found that the explanation in the 8th article [Isolation or non-isolation for transactions?] introduced too many concepts, making the analysis very complex.

Therefore, I rewrote the 8th article so that it will be more convenient for us to manually determine visibility. [When you see this, I suggest you reopen the 8th article and study it carefully again. If you have any questions during the learning process, please feel free to leave me a comment.]

The reason why we say the update by session B’ is not visible to session A is because, at the moment when the session A’s view array is created, session B’ is still active, so it falls under the category of “uncommitted version, not visible.”