11 No SQL in High Concurrency Scenarios How Databases and No SQL Complement Each Other

11 NoSQL in High-Concurrency Scenarios - How Databases and NoSQL Complement Each Other #

Hello, I am Tang Yang.

In the previous lessons, I explained how to transform a traditional relational database into a distributed storage service to withstand high concurrency and high traffic in your vertical e-commerce project.

When it comes to storage services, we generally make improvements in two areas:

  1. Improving its read and write performance, especially read performance, because the products we face mostly have a higher query QPS than write QPS. For example, popular platforms like WeChat Moments, Weibo, and Taobao have a much higher query QPS compared to write QPS.

  2. Enhancing its scalability in terms of storage to meet the demands of large data volumes.

Previously, we learned about techniques like read-write separation and sharding to achieve these improvements in traditional relational databases. However, there are still some problems that cannot be solved.

For example, in a microblogging project, the amount of relational data reaches billions. Even if we split it into 1024 tables, the data volume of each table would still reach billions. Moreover, the amount of relational data is increasing rapidly. Even if we split it into more tables, the data volume will quickly reach a bottleneck. It is difficult to solve this problem using traditional databases because they are weak in terms of scalability. This is where we can utilize NoSQL databases, as they have inherent distributed capabilities and can provide excellent read and write performance to complement the shortcomings of traditional relational databases. So, how do they achieve this?

In this lesson, I will continue using your vertical e-commerce system as an example to help you understand how to use NoSQL databases and relational databases together to handle high concurrency and high traffic.

First, let’s understand what a NoSQL database is.

NoSQL, No SQL? #

NoSQL, I’m sure you are familiar with it. It refers to the collective term for other database systems different from traditional relational databases. It does not use SQL as a query language and provides excellent horizontal scalability and read-write performance, making it a perfect fit for high-concurrency and large-scale data storage services in Internet projects. Therefore, some major companies, such as Xiaomi, Weibo, and Momo, tend to use it for high-concurrency and high-capacity data storage.

After more than a decade of development, various types of NoSQL databases have emerged. Let me give you a few examples:

  • Key-Value (KV) stores like Redis and LevelDB. Compared to traditional databases, these stores offer extremely high read-write performance, and are often used in scenarios with high performance requirements.
  • Columnar storage databases like Hbase and Cassandra. These databases store data not in rows like traditional databases, but in columns. They are suitable for scenarios such as offline data analytics.
  • Document-oriented databases like MongoDB and CouchDB. These databases are schema-free, allowing for flexible extension of fields in a data table. For example, in an e-commerce system, different categories of products may have varying numbers of fields. Using a relational database would require continuously adding support for new fields, whereas using a document-oriented database is much simpler.

When NoSQL databases were first introduced, they were considered a silver bullet that could replace relational databases. In my opinion, this may be due to the following reasons:

  • Making up for the performance shortcomings of traditional databases.
  • Convenient database schema changes without altering the original data structure.
  • Suitable for scenarios with large data volumes commonly seen in Internet projects.

However, this view is a misconception, as we gradually discovered that in the context of business development, we still need the powerful query capabilities, transaction support, and flexible indexing features provided by SQL and traditional databases. NoSQL can only serve as a supplement for certain scenarios.

Next, let’s take a look at how NoSQL databases complement relational databases. Understanding this part will enable you to better utilize NoSQL databases to supplement the limitations of traditional databases in practical projects.

First, let’s focus on the write performance of databases.

Improving Write Performance with NoSQL #

Most database systems use traditional mechanical hard drives, which have two ways of accessing data: random I/O and sequential I/O. Random I/O requires expensive disk seek time, so in general, its read/write efficiency is two to three orders of magnitude lower than that of sequential I/O. Therefore, to improve write performance, it is important to minimize random I/O.

For example, in MySQL’s InnoDB storage engine, updates to binlogs, redologs, and undologs are done using sequential I/O, while updates to datafiles and index files require random I/O. Although relational databases have made many optimizations to minimize random I/O, such as writing to memory first and then batching the writes to disk, random I/O still occurs.

In the InnoDB engine, indexes are organized using a B+ tree (mentioned in the previous lesson), and the primary key in MySQL is a clustered index (where data and index entries are stored together). Since data and index entries are stored together, when inserting or updating data, the database needs to find the appropriate location to insert the data, resulting in random I/O. Additionally, once page splits occur, data movement becomes inevitable and significantly impacts write performance.

How do NoSQL databases solve this problem?

There are multiple solutions, but let’s focus on the most common one: LSM-tree-based storage engines, which are widely used in many NoSQL databases.

The LSM tree (Log-Structured Merge Tree) sacrifices some read performance in exchange for high write performance. HBase, Cassandra, and LevelDB use this algorithm as their storage engine.

The idea is simple: data is first written to a memory structure called a MemTable, where the data is sorted by the Key of the writes. To prevent data loss in case of power outage or restart, the data in the MemTable is typically backed up on disk using Write Ahead Logging.

When the MemTable accumulates to a certain size, it is flushed to disk, generating a new file called an SSTable (Sorted String Table). When the number of SSTables reaches a certain threshold, they are merged to reduce the number of files. Since SSTables are all sorted, the merging process is fast.

When reading data from the LSM tree, we first search the MemTable, and if the data is not found, we search the SSTables. Because the stored data is sorted, the search efficiency is high, although slightly lower than that of B+ tree indexes due to the multiple SSTables that need to be searched.

img

There are many similar algorithms to LSM trees, such as the Fractal tree index structure used by TokuDB. The core idea of these algorithms is to convert random I/O into sequential I/O, thereby improving write performance.

In the following section on caching, I will also focus on how we use key-value NoSQL storage to improve read performance. As you can see, one way NoSQL databases complement relational databases is by improving read and write performance.

Scenario Supplement #

In addition to improving performance, NoSQL databases can also complement traditional relational databases in certain scenarios. Let’s take a specific example.

Suppose one day the CEO approaches you and tells you that he is planning to add a search function to your vertical e-commerce project. He wants to support fuzzy searching for products based on their names and asks you to research a solution as soon as possible.

At first, you think it’s very simple. You just need to execute a query like “select * from product where name like ‘%***%’” in the database, right? However, during the actual execution, you encounter a problem.

You realize that not all such queries can utilize indexes. Only queries with a trailing wildcard can leverage an index. For example, the query “select * from product where name like ‘% refrigerator’” doesn’t use the index on the “name” field, whereas “select * from product where name like ‘Sony %’” does utilize the index on the “name” field. And if an index is not used, scanning the entire table data is unacceptable in terms of performance.

So you do a search on Google for a solution and discover that everyone is using the open-source component Elasticsearch to support search requests. Elasticsearch itself is implemented based on an inverted index. So what exactly is an inverted index?

An inverted index refers to the mapping between certain columns in the records and the tokens generated from those columns. For example, in your vertical e-commerce project, you have the following records:

img

You tokenize the product names and establish the mapping between tokens and product IDs, as shown below:

img

This way, if a user searches for “refrigerator”, you can show them the products with the IDs 1 and 3.

Elasticsearch, as a common NoSQL database, utilizes inverted index as its core technology to provide distributed full-text search services, which is difficult to achieve using SQL queries in traditional relational databases. Therefore, you can see that NoSQL can replace traditional databases and provide data storage services in certain business scenarios.

Improving Scalability #

In terms of scalability, many NoSQL databases have inherent advantages. Taking your vertical e-commerce system as an example, you have added a commenting system to your e-commerce system. Initially, you made an optimistic evaluation and believed that the amount of comments in the e-commerce system would not grow rapidly. Therefore, you divided it into 8 databases, with each database divided into 16 tables.

However, after the commenting system went online, the amount of storage grew rapidly, and you had to split the database into more databases and tables. The process was very painful, and data migration was also prone to errors.

At this point, you consider whether to use NoSQL databases to completely solve the scalability problem. After research, you found that they were designed from the beginning to consider distributed and large data storage scenarios. For example, MongoDB has three scalability features.

  • The first is Replica, also known as a replica set. You can think of it as master-slave separation, which ensures that data will not be lost when the master fails by copying data multiple times. At the same time, Replica can also share read requests. In Replica, there is a primary node that handles write requests and records data changes to the oplog (similar to binlog). After receiving the oplog, the secondary node modifies its own data to keep it consistent with the primary node. Once the primary node fails, MongoDB will select one of the secondary nodes to become the new primary node and continue to provide data write services.
  • The second is Shard, which means dividing data into multiple parts according to certain rules and storing them on different machines. MongoDB’s Sharding feature generally requires three roles to support it. One is the Shard Server, which is the node that actually stores data and is an independent mongod process. The second is the Config Server, which is also a group of mongod processes that mainly store some metadata, such as which shards store which data. Finally, there is the Route Server, which does not actually store data but only serves as a router. After obtaining the metadata from the Config Server, it routes requests to the correct Shard Server.

img

  • The third is load balancing. When MongoDB discovers that the data distribution between shards is uneven, it will start the Balancer process to redistribute the data, making the data on different Shard Servers as balanced as possible. When our Shard Server’s storage space is insufficient and needs to be expanded, data will be automatically moved to a new Shard Server, reducing the cost of data migration and verification.

As you can see, the scalability features built into NoSQL databases eliminate the need for database sharding and master-slave separation. They are also a good supplement to traditional databases.

You may think that NoSQL databases have matured enough to replace relational databases. However, currently, NoSQL can only exist as a supplement to traditional relational databases, filling the gaps in performance, scalability, and certain scenarios. Therefore, when using or choosing, you should flexibly apply them according to your own scenarios.

Summary of the Course #

In this class, I introduced you to the advantages of NoSQL databases in terms of performance and scalability, as well as some of their special features. The main points are as follows:

  1. In terms of performance, NoSQL databases use certain algorithms to transform random writes to disk into sequential writes, improving write performance.

  2. In certain scenarios, such as full-text search functionality, relational databases cannot efficiently support them and require the support of NoSQL databases.

  3. In terms of scalability, NoSQL databases inherently support distributed computing and have features such as data redundancy and data sharding.

All of these make them a good supplement to traditional relational databases. What you need to know is that there are many types of NoSQL databases to choose from, and each component has its own characteristics. When making a selection, you need to have a deep understanding of its implementation principles and be familiar with it from an operational perspective. This way, you can promptly find solutions when problems arise. Otherwise, blindly jumping onto a new NoSQL database may eventually lead to unsolvable issues and become a burden on the overall system.

In a previous project, I used Elasticsearch as a persistent storage to support the community’s feed stream feature. It was great during the initial development phase, as you could efficiently query any field in the feed and iterate on business functionality quickly. However, as traffic increased later on, due to a lack of mature operational capabilities with Elasticsearch, frequent failures occurred. Especially during peak periods, there would be issues with unavailable nodes. And because of the huge business pressure, it was impossible to allocate manpower and resources to deeply learn and understand Elasticsearch. Eventually, we had to make major modifications and switch back to familiar MySQL. So, when using open-source components, it’s not enough to just stay at the “hello world” stage; you should have sufficient operational control over them.