24 Distributed Database Index Design the Best Design Practice for Secondary Index and Global Index

24 Distributed Database Index Design - The Best Design Practice for Secondary Index and Global Index #

In the previous two lectures, we learned about the architecture of a distributed database in MySQL. I believe you now have a clear understanding of the overall architecture of a distributed database and how data is sharded.

Combined with the “Table Structure Design” in the first module, you should be able to complete the table structure design in a distributed database architecture.

However, in a distributed database architecture, the design of indexes also needs to be adjusted, otherwise, the advantages of linear scalability in a distributed architecture cannot be fully utilized. Therefore, in this lecture, we will learn how to design indexes correctly in a distributed database architecture.

Primary Key Selection #

For the primary key, it is necessary to ensure uniqueness across all shards, and it is essentially a globally unique index. If we use the commonly used auto-increment as the primary key, we will encounter a significant problem.

Because auto-increment does not obtain a value before insertion but fills in a NULL value and obtains the auto-increment value through the function last_insert_id(). Therefore, if we use auto-increment to implement the primary key on each shard, it is possible to have the same auto-increment value on different shards.

For example, for the order table “orders” in an e-commerce system, its table structure is as follows (the sharding key is “o_custkey”, and the primary key of the table is “o_orderkey”):

CREATE TABLE `orders` (

  `O_ORDERKEY` int NOT NULL auto_increment,

  `O_CUSTKEY` int NOT NULL,

  `O_ORDERSTATUS` char(1) NOT NULL,

  `O_TOTALPRICE` decimal(15,2) NOT NULL,

  `O_ORDERDATE` date NOT NULL,

  `O_ORDERPRIORITY` char(15) NOT NULL,

  `O_CLERK` char(15) NOT NULL,

  `O_SHIPPRIORITY` int NOT NULL,

  `O_COMMENT` varchar(79) NOT NULL,

  PRIMARY KEY (`O_ORDERKEY`),

  KEY (`O_CUSTKEY`)

  ......

) ENGINE=InnoDB

If we design “o_orderkey” as an auto-increment as shown above, it is likely that records with the same “o_orderkey” of 1 will appear in different shards, as shown in the diagram below:

Therefore, in a distributed database architecture, it is best not to use auto-increment as the primary key. This is also emphasized in the “Table Structure Design” in the first module: auto-increment has poor performance, low security, and is not suitable for distributed architecture.

Now that we have explained all the problems with “auto-increment primary keys”, how do we design the primary key? It is still best to use a globally unique key as the primary key, such as the ordered UUID automatically generated by MySQL, a globally unique key generated by the business (such as a ticket issuer), or an open-source UUID generation algorithm such as the Snowflake algorithm (but there is a problem of time backtracking).

In summary, using an ordered globally unique key as a replacement for auto-increment is the mainstream design standard for primary keys in this era. If you are still using auto-increment as your primary key, it may mean that you are falling behind the development of the times.

Index Design #

By using the sharding key, SQL queries can be routed to the specified shard, but in real production environments, businesses also need to access tables through other indexes.

Using the “orders” table mentioned earlier as an example, if the business needs to query based on the “o_orderkey” field, such as querying the details of the order with ID 1:

SELECT * FROM orders WHERE o_orderkey = 1

As we can see, since the sharding rule is not the sharding key, we need to query 4 shards to get the final result. If there are 1000 shards below, we need to execute this SQL statement 1000 times, which can be quite slow.

However, we know that “o_orderkey” is the primary key and there should only be one record returned, which means “o_orderkey” exists in only one shard. In this case, we can design in the following two ways:

For the same set of data, implement another database and table sharding based on the “o_orderkey” field in the “orders” table;
Add the sharding key information to the index.

The essence of these two design approaches is to achieve a space-for-time effect through redundancy; otherwise, all shards would need to be scanned, and the efficiency would become very poor when there are a large number of shard data.

The first approach achieves redundancy by duplicating the table. For querying “o_orderkey”, we only need to query directly in the shard where “o_orderkey = 1” exists, which is the most efficient. However, the drawback of this design is that there is too much redundant data.

Therefore, one of the improved approaches is to implement an index table that only contains “o_orderkey” and the sharding key “o_custkey”, as follows:

CREATE TABLE idx_orderkey_custkey (

  o_orderkey INT

  o_custkey INT,

  PRIMARY KEY (o_orderkey)

)

If the index table is large, it can also be sharded and partitioned. However, its sharding key is o_orderkey. In this case, if we query based on the o_orderkey field, we can achieve a table lookup similar to a secondary index implementation: first, query the index table to obtain the value of o_custkey associated with the record where o_orderkey = 1, and then query based on o_custkey to locate the desired data, as shown below:

SELECT * FROM orders WHERE o_orderkey = 1

=>

# step 1

SELECT o_custkey FROM idx_orderkey_custkey 
WHERE o_orderkey = 1

# step 2

SELECT * FROM orders 
WHERE o_custkey = ? AND o_orderkey = 1

In this example, we split one SQL query into two SQL queries. After the split, both queries can be performed based on the sharding key, which ensures that the query operation is performed in a single shard. Regardless of the number of shards, only the information from 2 shards needs to be queried, greatly improving the performance of the SQL query.

Through the use of the index table, although the storage is slightly redundant and the overall table capacity is reduced, data storage is still not elegant enough as it needs to be based on another sharding key.

Therefore, the optimal design is not to create an index table, but to save the sharding key information in the column being queried. This way, the sharding information can be directly determined by querying the column.

If we design the primary key of the orders table as a string, with the last part of the string containing the sharding key information, such as:

o_orderkey = string(o_orderkey + o_custkey)

Then, if we query based on o_orderkey:

SELECT * FROM Orders
WHERE o_orderkey = '1000-1';

Since the design of the o_orderkey field includes the sharding key information directly, we can directly determine that this order is in shard 1 and query shard 1 directly.

Similarly, during insertion, since we know the corresponding value of o_custkey during insertion, we only need to perform string concatenation at the application layer and then insert it into the database.

Compared to the design of redundant tables and index tables, this implementation is more efficient. Queries can determine the shard information of the data in advance, and only one query is needed to obtain the desired results.

The downside of this implementation is that the primary key value will be slightly larger and the storage will increase accordingly. However, as we mentioned in Lesson 05, as long as the primary key values are ordered, the insertion performance will not deteriorate. By saving the sharding information in the primary key values, the query efficiency can be greatly improved, making this space-for-time design highly worthwhile.

Of course, the designs we discussed here are specifically for unique index designs. If it is a non-unique secondary index query, unfortunately, it still needs to scan all shards to obtain the final results, as shown below:

SELECT * FROM Orders
WHERE o_orderdate >= ? o_orderdate < ?

Therefore, I would like to remind you again that the requirement for designing a distributed database architecture is that the vast majority of business requests can be located on one shard based on the sharding key.

If most business requests require scanning all shard information to obtain the final result, then it is not suitable for distributed architecture transformation or design.

Finally, let’s review the design of the Taobao user order table that we discussed earlier:

The above image is my Taobao order information. As you can see, the last 6 digits of the order number are all ‘308113’, so we can speculate that:

The sharding key of the Taobao order table is the user ID.
In the Taobao order table, the primary key of the order table contains the user ID, which is the sharding information. Therefore, by querying based on the order number, we can obtain the shard information and query only one shard to obtain the final result.

Global Tables #

In distributed databases, there are sometimes tables for which it is not possible to provide a sharding key. However, these tables are very small and are generally used for querying global information that is updated infrequently.

For example, the ’nation’ table in the tpch database is used to store country information. However, in the SQL join queries we discussed earlier, this table is often used. For such global tables, they can be stored in each shard so that queries do not need to cross shards. The design is as follows:

Unique Indexes #

Lastly, let’s discuss the design of unique indexes. Like primary keys, if an index created solely based on the unique constraint of the database table cannot guarantee uniqueness across all shards.

Therefore, in distributed databases, unique indexes also need to be implemented using a mechanism similar to the UUID of the primary key, using global uniqueness to replace local uniqueness. In fact, even in a single-instance MySQL database architecture, we recommend using global uniqueness designs. Because at some point, your business may need to be upgraded to require global uniqueness.

Summary #

Today we have introduced the very important topic of index design in distributed databases. The content is very practical and is crucial for designing distributed architectures. We hope that everyone reads it repeatedly, grasps the key points discussed in this lesson, and summarizes them as follows:

Design the primary key of the distributed database using an ordered UUID that is globally unique.
Design unique indexes in the distributed database using UUID’s globally unique design to avoid unique problems caused by local indexes.
If the unique index in the distributed database is not the sharding key, the sharding information can be saved during the design, so that queries can be directly routed to one shard.
For global tables in distributed databases, redundancy mechanisms can be used to save them in each shard. This can avoid cross-shard queries during querying.