01 Structural Hazards Your Database Tables May Become a Performance Bottleneck

01 Structural Hazards Your Database Tables May Become a Performance Bottleneck #

Hello, I’m Xu Changlong. Welcome to the first chapter of this course.

In this chapter, we will mainly discuss how to optimize systems with high concurrency but predominantly read operations. I will use the user center as an example to illustrate several key points of transformation.

The user center is a typical system with predominantly read operations and few write operations. It can be said that most of our systems belong to this type. Such systems can achieve significant performance improvements through caching. Moreover, when traffic increases, the user center is usually the first module to be optimized in system transformation because it is often heavily coupled with multiple systems. Therefore, it is essential to review and optimize this module, as it has a significant impact on subsequent high-concurrency transformations of the entire system.

In today’s lesson, I will guide you on how to optimize the data organization of a user center with predominantly read operations. This will make the data more cache-friendly. Data organization is an essential technique, and it is recommended to review the table structure of any legacy system when performing high-concurrency transformations.

Legacy systems often have many issues when using databases, such as having too many fields in entity tables, multiple query dimensions and purposes for tables, and complex relationships between tables, including many-to-many relationships. These issues make caching transformations extremely challenging and significantly slow down the transformation process.

If we start with data organization and perform transformations on specific scenarios before implementing caching, the subsequent transformations will become much simpler. Therefore, it is helpful to first review the database structure and then carry out high-concurrency transformations on the system.

In this lesson, I will explain several specific patterns and approaches to help you quickly determine whether the current table structure is suitable for high-concurrency scenarios, facilitating system upgrades and transformations in the future.

Simplifying Data Leads to Better Performance #

To facilitate the discussion, let me first provide a brief introduction to the user center, as shown in the figure:

The primary function of the user center is to maintain user information, user permissions, and login status. Most of the data it stores belong to read-heavy and write-light data. The common optimization approach for the user center is to completely decouple it from the business and appropriately increase caching to improve system performance.

Let me give you a simple example: at the time, the user account information in the entire table was close to 20 million. I decoupled and simplified the functions and fields of the table, so that the account table in the user center only retained the account and password required for user login:

CREATE TABLE `account` (
  `id` int(10) NOT NULL AUTO_INCREMENT,
  `account` char(32) COLLATE utf8mb4_unicode_ci NOT NULL,
  `password` char(32) COLLATE utf8mb4_unicode_ci NOT NULL,
  `salt` char(16) COLLATE utf8mb4_unicode_ci NOT NULL,
  `status` tinyint(3) NOT NULL DEFAULT '0',
  `update_time` int(10) NOT NULL,
  `create_time` int(10) NOT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `login_account` (`account`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;

We know that the database is the core of the system. If it is slow, all our businesses will be affected, and our services rarely exceed the performance limit of the core database. The core of reducing the fields in the account table is that small-length data can be processed quickly in terms of throughput, query, and transmission, and it is also easier to manage and cache.

The simplified table has fewer fields, and its corresponding business purpose is also relatively straightforward. Its main function is to check whether the user’s login account and password are correct. Besides that, it is not normally accessed for other purposes or involved in broader queries. As you can imagine, this table must have excellent performance, even though it stores 20 million accounts.

However, you should note that while simplifying data can result in better response speed, excessive design is not advocated. Because if the table lacks redundancy in fields, it can make business implementation more cumbersome. For example, if we remove the nickname and avatar from the account table, we will need to read the database one more time each time we log in, and we also need to keep an eye on cache synchronization for the account table. But if we keep the nickname and avatar in the account table, we can continue with other business logic immediately after login verification, without the need for an additional database query.

So you see, some queries often lead to an extra database query simply because one or two fields were simplified, and we also need to consider cache synchronization. It is not worth it. Therefore, we need to find a balance between “more fields” and “fewer functions”.

Classification and Deep Organization of Data #

In addition to optimizing table performance and maintainability through the functions of table reduction, we can also implement different caching optimizations for different types of tables. Let’s take the example of the User Center table shown in the figure below:

There are four main types of data: main entity table, auxiliary query table, entity relationship, and historical data. The caching strategy for each type of data is different. If we cache data whose functions are not clearly defined, we will encounter many challenging problems when using the data.

I have encountered the following mistake before: caching the continuous-growing operation history of user visits. The purpose of this record is to count the number of visits from friends and strangers. However, it also keeps a flag indicating whether the user is friends with the visitor. This means that once the user relationship changes, these historical data need to be updated accordingly, otherwise the friend relationships stored in them will become “outdated”.

It is unreasonable to mix historical records with real-time updated friend status. If we classify and organize the data, we should split it into three functional tables to manage them separately:

Historical record table, not cached, only show the recent few records, temporarily cache in extreme cases.
Friend relationship (caching relationships, used for counting the number of friends).
Visit statistics (temporary cache).

After understanding the importance of data classification and organization, let’s look at how to optimize caching for the above four types of data.

Entity Data Table #

First, let’s take a look at the user account table, which is an entity data table. Entity data tables are generally used as the main tables, where each row represents an entity, and each entity has a unique ID as its identifier. In this case, “entity” represents an abstract concept, and the specific fields represent the real-time status attributes of the entity.

This ID is crucial for caching in high-concurrency environments. After a user logs in, they need to directly find corresponding orders, nicknames, avatars, and friend list information using their account ID. If our business relies on such retrieval methods, the performance will definitely be good and it is suitable for long-term caching.

However, besides ID-based retrieval, there are also some queries that require multiple conditions, such as:

What are the orders placed on July 4th to purchase headphones?
How many new registered users are there among users from Tianjin? How many are old users?
Were there any account registrations yesterday with a username prefix “rick”?

Such conditional query and statistical data are not suitable for caching, because the data cached by high-concurrency services are usually those that can be quickly matched directly through a hash. However, caching data with conditional queries and statistical operations can lead to problems such as data inconsistency and unstable performance due to uncertain data volume. Moreover, if the involved data changes, it is difficult to determine which caches need to be synchronized through the data.

Therefore, this type of data is only suitable for storing in relational databases or pre-calculating and caching the results for direct use, with regular updates.

In addition to the difficulty of caching with conditional queries, real-time calculations on data, such as count() and sum(), also suffer from delayed updates. Similarly, the results can only be cached periodically, and frequent queries are not possible. Therefore, we should try to avoid using databases for calculations in the subsequent development process.

Going back to the previous topic, let’s continue discussing the design of common entity data tables. In fact, these tables are designed based on the main query requirements of the business. If we do not query the table according to this purpose, the performance will often be poor.

For example, for the table used for account login, when we query whether the user nickname contains the phrase “geek”, a lot of additional work needs to be done. We need to add an index for the “user nickname” field, and this kind of like query will scan the entire table for calculation.

If this type of query is frequent, it will seriously affect the login of other users. Additionally, the newly added nickname index will also reduce the performance of inserting data into the table. That’s why our back-end systems often have a separate slave database for special indexing.

Generally speaking, when optimizing the performance of read operations with cache in high-concurrency scenarios, the cache mainly stores entity data. The common approach is to first fetch data by “key prefix + entity ID” (e.g., user_info_9527), and then obtain specific data through the relationships stored in the cache. For example, we can directly retrieve the user’s friend relationship key by using their ID and get a list of their friend IDs. Using a similar method, we can implement common relationship queries for users in Redis.

Overall, entity data is the main carrier of our business. Once we find the main entity, we can use the cache to find all related data for serving users. Now, let’s summarize the core ideas of organizing entity tables:

Reduce the total length of data.
Reduce the business functions carried by the table.
Reduce statistical query operations.
Entity data is more suitable for caching.
Enable entity retrieval through ID or relationships.
Reduce the external services provided by real-time conditional filtering methods.

Now let’s continue to look at the structure of the other three types of tables, and you will find that they are not suitable for caching because maintaining their consistency is very cumbersome.

Auxiliary Entity Tables #

To streamline the data and facilitate management, we often split the main table based on different purposes, commonly known as vertical partitioning.

The purpose of vertical partitioning is generally twofold. One is to extract less frequently used data from the main table. The main table often has many fields, and by splitting it, we can simplify its functions. The primary key of the auxiliary table usually remains the same as the main table or is associated through a record ID. The common relationship between them is 1:1. The data stored in auxiliary tables are generally not used in primary business queries. These data are only used in very few scenarios, such as the user account table being used for user login, while the auxiliary information table stores information like home address, province, WeChat, and zip code that are not usually displayed.

Another purpose of auxiliary tables is to assist in querying. They can be used when the existing data structure cannot meet the entity queries from other dimensions.

For example, there is a table designed based on “teachers” as the main entity, and each business query retrieves student and class data based on “current teacher ID + conditions”. However, when using the system from the perspective of students, frequent queries of teacher data based on “students and classes” as the basis are inconvenient and inefficient. In such cases, the student and class data can be split and an auxiliary table can be created to store all the detailed information, facilitating these types of queries.

Furthermore, I would like to remind you that since the auxiliary table created by splitting may have a 1:n or even m:n relationship with the main table, we need to regularly organize and verify the data to ensure the synchronization and integrity of our redundant data.

However, maintaining auxiliary tables with non-1:1 data relationships is not easy because they are prone to data inconsistency or delay. In some scenarios, refreshing the cache of all related relationships is time-consuming and labor-intensive. If the verification of this data is executed periodically using a script to identify data discrepancies, it would be simpler.

In addition, in many cases, in order to improve query efficiency, we duplicate the same data in multiple tables. When the data is updated, we need to synchronize the redundant tables and cached data.

It is worth mentioning that in the industry, we also use some open-source search engines to assist in similar relationship-based business queries. For example, ElasticSearch is used for product search, and OpenSearch is used for article search. These scalable services can greatly reduce the database query pressure, but the only drawback is that it is difficult to achieve strong data consistency and requires manual detection and verification of data between the two systems.

Entity Relationship Tables #

Next, let’s talk about the relationships between entities.

In relational-type data, I strongly recommend using a separate relationship table to record the m:n association between entities, so that the maintenance becomes easier and avoids the difficulties caused by mutual dependencies between the two entities.

When caching data with 1:n or m:n relationships, we recommend estimating the potential data size in advance to prevent slow cache retrieval due to excessive data. Usually, this relationship is stored in the cache with the main entity’s ID as the key, and multiple related IDs are stored in the value to record the association between these two data. Only for business caches that are read frequently would we consider organizing the data by relationships and caching them as a whole for easy querying and usage.

It is important to note that this kind of associated data is prone to multi-level dependencies, which makes it difficult to organize and maintain. When the related tables or conditions are updated, we need to synchronize the changes in this data in the cache. Therefore, it is difficult to maintain such multi-level dependency relationships in high-concurrency systems, and often we would lower the consistency requirements to meet the high-concurrency business scenarios.

In summary, data that can be precisely matched by ID is suitable for caching. Structures formed by String, List, or Set commands that have multiple values are suitable for auxiliary or relationship queries (1:1, 1:n, m:n). Finally, it is worth mentioning that although the Hash structure is suitable for entity table attributes and status, the HGETALL command has poor performance and can easily cause cache to stall. It is not recommended to use it.

Action History Table #

To this point, we have completed most of the organization, and you now have a deeper understanding of which data can be cached. To reinforce this, let me give you some counterexamples.

Generally speaking, the action history data table records the actions or status changes of data entities, such as user login logs, user point consumption records, etc. These types of data constantly grow with time and are usually used to record and display the most recent information. However, they are not recommended for real-time statistical calculations in business.

You may have doubts about my suggestion, so let me give you a simple example. Suppose we want to detect the count of the product with ID 15 that a user has obtained from a score redemption record table containing 20 million records:

CREATE TABLE `user_score_history` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `uid` int(10) NOT NULL DEFAULT '',
  `action` varchar(32) NOT NULL,
  `action_id` char(16) NOT NULL,
  `status` tinyint(3) NOT NULL DEFAULT '0'
  `extra` TEXT NOT NULL DEFAULT '',
  `update_time` int(10) NOT NULL DEFAULT '0',
  `create_time` int(10) NOT NULL DEFAULT '0',
  PRIMARY KEY (`id`),
  KEY uid(`uid`,`action`),
) ENGINE=InnoDB AUTO_INCREMENT=1 
DEFAULT CHARSET=utf8mb4 
COLLATE=utf8mb4_unicode_ci;

SELECT uid, count(*) AS action_count, product_id 
FROM user_score_history 
WHERE uid = 9527 AND action = "fetch_gift" 
AND action_id = 15 AND status = 1
GROUP BY uid, action_id

It is easy to see that this table has a large amount of data, recording a large number of entity action operation histories, and the fields and indexes are not suitable for such queries. When we calculate the count of the product with ID 15 that a user has obtained, we can only first filter the data through the UID index to narrow down the scope. However, even this filtered data will still be large. Furthermore, as time goes by, the data in this table will continue to grow, and its query efficiency will gradually decrease.

Therefore, for conclusions and data obtained from statistical calculations based on a large amount of data, I do not recommend providing real-time statistical calculation services externally. This kind of query will seriously slow down our database and affect service stability. Even if we use cache to temporarily store statistical results, it is still considered a temporary solution. It is recommended to use other tables to perform similar tasks, such as real-time querying of the redemption record table, which will yield better results.

Summary #

In the early stages of a project, the function design of a data table is often relatively simple. However, as time goes on and business development changes, the table undergoes multiple modifications, resulting in significant changes in its usage and functions, making our system increasingly complex.

Therefore, when the traffic exceeds the capacity of the database and cache transformation is needed, we recommend classifying the data table according to the current business logic. This can help you quickly identify which fields and functions in the table are not suitable for use in specific types of tables, which will provide better cost-effectiveness for data in the cache.

Generally speaking, data can be divided into four categories: entity tables, entity auxiliary tables, relationship tables, and historical tables. The main criteria for judging whether they are suitable for caching are as follows:

Entities that can be quickly matched by ID and data that can be quickly queried by relationships are suitable for long-term caching.
Data that is filtered and calculated through combination conditions can also be placed in temporary cache, but updates have a delay.
Data tables with high data growth or data that deviates from the original design should not be cached or recommended for caching.

Reflection Question #

Please take a moment to think about whether records of users inviting other users to register belong to historical records or relationship records.

Feel free to discuss and exchange ideas with me in the comments section. See you in the next class!