06 Practical Design Designing the Simplest Distributed Database

06 Practical Design - Designing the Simplest Distributed Database #

This lesson is a review and practical extension class. After learning from the previous lessons, I believe you have gained a clear understanding of distributed databases. Today, we will summarize the learning outcomes of Module 1 and deepen our understanding through a practical case. I will also answer some typical questions raised by students in the previous lessons.

Key Summary of Distributed Databases #

Let’s now summarize the key knowledge of Module 1.

This module introduced what a distributed database is. Starting from the perspective of historical development, it discussed the distributed models of traditional databases, the analysis-based distributed databases in the context of big data, and then touched upon database middleware in the context of IOE, as well as open source database models. It also discussed Distributed SQL and NewSQL, and finally introduced HTAP-integrated databases, which are considered the future trend of distributed databases.

Through the learning in Lesson 1, I believe you not only understand the historical development of distributed databases from integration to distribution and then back to integration, but more importantly, you know what a distributed database fundamentally is.

Broadly speaking, a distributed database refers to a database running on different hosts or containers, which is why we see a rich product list. However, because of the extensive product line, it is not possible for me to cover all the knowledge points comprehensively. Furthermore, since the term “database” can be narrowly understood as OLTP-type transactional databases, this course focuses on the technical systems of Distributed SQL and NewSQL, which are OLTP-like distributed databases. In the subsequent modules, I will focus on introducing the related knowledge involved in them. Consider this as a preview.

At the same time, this module also highlighted that sharding and synchronization are two important characteristics of distributed databases.

We also learned about the evolution of SQL and what NoSQL is. This part mainly “sets things straight” in terms of some historical concepts, demonstrating that NoSQL is actually a marketing concept. Then we introduced the features of NewSQL and Distributed SQL. As mentioned earlier, these are actually the focal points of this course.

As I mentioned, SQL is of great importance, which makes its audience very extensive. If a database wants to attract more users and achieve breakthroughs in influence or commercial areas, SQL can be considered an essential feature. Conversely, if it is a professional field of distributed databases, then SQL is not as important as the two characteristics of sharding and synchronization.

In the lesson on sharding, we first learned about the significance of sharding, which is the key feature of distributed databases to increase data capacity. We learned about the main sharding algorithms, including range sharding and hash sharding, as well as some optimization methods. Finally, we used Apache ShardingShpere as an example to intuitively demonstrate the application of sharding algorithms, including the generation algorithm of distributed unique IDs and other related content.

Data sharding is one of the two core contents of distributed databases, but the concept is relatively straightforward. It is not as difficult to learn as data synchronization.

We often encounter a problem: design a structure for database sharding that minimizes database migrations. In essence, this requirement is meaningless in the context of distributed databases, as automatic, flexible scaling of database nodes should be an essential feature of such databases. Overusing sharding algorithms to avoid database migrations may improve performance, but it is ultimately an incomplete technical solution with inherent flaws.

In the last part of Module 1, we learned about the concept of synchronizing data. Synchronization is actually a combination of replication and consistency. These two concepts work together to create various forms of data synchronization in distributed databases. Among them, replication is the premise and necessary condition for synchronization. In other words, if a piece of data does not need to be replicated, there is no concept of consistency, and therefore no synchronization technology.

In the lesson on synchronization, the first thing that entered our sight was asynchronous replication, which is similar to participation without consistency, and is the simplest form of replication. The other synchronization methods and technologies mentioned later, such as semi-synchronous, all involve the concept of consistency. In addition to replication patterns, we also need to pay attention to technical details such as replication protocols and replication methods. Finally, we used the development history of MySQL replication technology to summarize the characteristics of various replication technologies and pointed out that consistency algorithms as the core of strong consistency replication technology are the future development trend. Next, we introduced the knowledge related to consistency, which is the most abstract part of Module 1. This is because CAP theory and consistency models are tools for abstract evaluation of distributed databases. One of their benefits is that they can help us evaluate the consistency of a database quickly. For example, if a database claims to be linearly consistent CP database, then we would have a good understanding of its characteristics and even its implementation. Another benefit is that when designing a database, you can design the consistency features based on the problems you need to solve.

First of all, it is important to note that in CAP theory, “C” refers to the strongest consistency model, which is linear consistency. It is precisely because of this strong consistency that it does not satisfy all three CAP characteristics at the same time. It is also important to distinguish between availability and high availability. Availability is an abstract evaluation concept, and after a network partition, each partition only has one replica. As long as it provides service, we can say that it is available, but we cannot say that it is highly available. Finally, I mentioned that there are only two types of databases in the world: CP and AP, because partitioning (P) is an objective law that cannot be eliminated, and there won’t be CA databases.

After discussing CAP theory, I introduced consistency models. They originate from shared memory design, but their theories can be used by distributed databases and even general distributed systems. You need to know that the three types of consistency introduced in this part are all strong consistency, and their characteristics solve the replication latency mentioned earlier, making it appear consistent no matter which node you write to or query from. In addition, these three types of consistency are data consistency, and there is also client consistency that I will explain specifically in the later modules on distributed systems.

Finally, as a database, an important concept is transactions. What is the relationship between transactions and consistency? In fact, in the ACID properties of transactions, ACID refers to the guarantees provided by the database for consistency. Among them, I, namely isolation, is the key characteristic of transactions. And isolation actually solves the problem of parallel transactions. The research on consistency models is about single-object, single-operation problems, solving the problems between non-parallel transactions. Therefore, the sum of isolation and consistency models is the characteristic of distributed database transactions.

So far, we have summarized the main content of Module 1. After learning this knowledge, what other uses are there besides helping you evaluate distributed databases? Now let’s try to design a distributed database.

Why implement a distributed database? #

Distributed databases, especially NoSQL and NewSQL databases, are the main development direction nowadays. At the same time, there is a wide variety of these two types of databases. Many of them are designed for specific scenarios, such as Elasticsearch in NoSQL, which is designed for search scenarios, and Redis for caching scenarios. NewSQL databases have a wide range of options, and domestic companies like Didi and ByteDance have implemented their own NewSQL databases based on their business characteristics. Not to mention big companies like BAT and Google, they all have their own NewSQL databases.

The driving force behind this comes from internal demand and external environment, which together have contributed to the current situation.

Internal demand is that as a certain specific business emerges and its usage scale expands, there is a growing need for a bottom-layer database to solve this problem. Because the database can guarantee certain consistency properties at the database level, and the inherent service-oriented nature of distributed databases provides great convenience to users, which accelerates the rapid development of these businesses.

The external environment is that the technologies used in distributed databases have gradually matured, and there are many open-source products to choose from. In the early days, one of the difficulties in building a database was that almost all the technologies involved needed to be built from the ground up, such as SQL parsing, distributed protocols, and storage engines. But now, there are many open-source projects and rich technical routes to choose from, which greatly reduces the threshold for building distributed databases.

These two factors interact with each other, leading many organizations and technical teams to start building their own distributed databases.

Designing a Distributed Database Case #

Some of you may know that I am also a founding member of Apache SkyWalking, an open-source APM system. Its architecture diagram can be found on the official website, as shown below.

You can see the Storage Option in the diagram, which refers to the various options at the database level. Except for the in-memory version of H2 for single machines, all other production-level databases are distributed databases. Choosing multiple aspects demonstrates SkyWalking’s strong adaptability, but more importantly, currently there is no database in the industry that can satisfy its use cases well.

So let’s try to design a database for it. Here, I have simplified the design process and only provided requirements analysis and conceptual design to demonstrate the design approach and help you better understand the key points of distributed databases.

Requirements Analysis #

Let’s first introduce the data processing characteristics of SkyWalking.

Due to SkyWalking’s APM features, it has a high demand for writes. Both the earlier used HBase and the current main storage Elasticsearch are write-friendly. To ensure fast and consistent data writes, the OAP node layer has already divided the metrics into shards, meaning the same metric is calculated on the same node. In addition, the application adopts a batch write mode, meaning batch writes are performed every 10 seconds.

SkyWalking can be seen as a system with few queries and many writes in its use cases, where queries rarely occur and can tolerate certain query latency. In terms of availability, it allows sacrificing some availability for performance. For example, the recommendation for the number of replicas in Elasticsearch is 0, meaning no data replication is performed.

If replication is enabled, consistency requirements are relatively low. Because for the maximum workload writes, data queries almost never occur during writes. However, some low-load operations require consistency, such as writing monitoring results, which need to be queryable immediately after being written.

Since the data structure of the query protocol is non-relational and there are not many types of queries, there is no need to support SQL statements.

The above analysis revolves around the core content of the first module and identifies the characteristics the database for SkyWalking should have. Now let’s design a distributed database for SkyWalking based on the key points mentioned in the requirements analysis.

Conceptual Design #

First of all, the OAP node has already done hash sharding, so we can pair the database nodes with the OAP nodes in a one-to-one or even many-to-one (secondary hashing) structure to ensure that one metric is only written to one database node, thereby avoiding the hassle of data migration. We can even deploy the database nodes together with the OAP nodes to minimize network latency and maximize resource utilization.

For elastic scaling, since SkyWalking can tolerate some unavailable data, we can directly add shard nodes without migrating data. If we want to ensure that old data can be queried, we can record the expansion time point; then old data can be queried on old nodes, and new data can be queried on new nodes. Since all data in SkyWalking has a lifecycle, once the old data on a node is deleted, in the case of downsizing, the node can also be safely removed.

Although SkyWalking does not require high availability, some data failures can result in a poor user experience. Especially for metrics like average response time within a day, once a node fails without replicas, the data of this metric will have significant deviation.

Once data replication is enabled, what consistency should be used? This problem needs to be considered from different perspectives. For a large amount of write metric data, weak consistency is sufficient. Because writes and reads are initiated by different endpoints, and writes can be considered as single-object single operations, weak consistency satisfies the requirements.

However, the same is not true for the alerting scenario. After an alert is generated, relevant personnel expect to be able to query the data immediately. If weak consistency is used, it is likely that queries cannot be performed. Here, we do not need to use particularly strong consistency and can use causal consistency to meet the requirements. The implementation approach is to pass the timestamp of the data generated by the alert to the user. When the user queries, they send the timestamp to a database node. If this node does not have data with that timestamp, it will attempt to request other nodes to synchronize. Lastly, regarding the query interface, since SQL is not necessarily required, we can implement querying and writing using a simple RESTful API. However, to make writing efficient, we can independently design a writing protocol using an efficient binary long connection style.

Case Summary #

Above is the design of a distributed database for the SkyWalking system based on the knowledge learned in Module 1 and the characteristics of its requirements. In terms of design, I only emphasized key design points without going into detail. As for the underlying storage engine, I believe you will have your own answer after completing Module 2.

Through this case, we can see that designing a distributed database only requires combining the characteristics of sharding and synchronization to roughly outline the appearance of a distributed database. In your work and studies, you can try designing a distributed database yourself to solve data problems with certain similarities.

Q&A #

Since the start of the course, I have received positive feedback from everyone, and I am pleasantly surprised by some of the very professional questions. I would like to express my sincere gratitude for your love of the course, and your positive feedback is what motivates me to continue writing.

Here, I have summarized some common questions and will answer them for you.

First, someone raised the question of whether the full name of a concept should be provided when it first appears.

First of all, I apologize. Out of personal habit, I directly output abbreviated or alias concepts that I am familiar with. This is indeed not very friendly for students who are encountering this knowledge for the first time. In future writings, I will try to avoid this issue as much as possible.

The second more concentrated question is whether MySQL InnoDB Cluster is a distributed database.

In the article, I mentioned that the definition of distributed is very broad. If we start from there, then InnoDB Cluster is a distributed database. But from the two characteristics we mentioned, it does not have the characteristics of sharding. Strictly speaking, it is not a distributed database, let alone NewSQL. However, we can introduce sharding functionality for it, such as using a middleware for database sharding, to build a distributed database based on InnoDB Cluster, i.e., a NewSQL database.

Here, I want to emphasize that you don’t need to get stuck in the trap of concept differentiation. This is not an exam, but real life is more complex than exams. Grasping the key characteristics is the key to adapting to change.

Okay, let’s end the Q&A here. Once again, thank you for your positive feedback, and I hope to see your wonderful comments after the next module.

Summary #

In this lecture, I first reviewed the main content of Module 1 to help you connect various parts and form a complete knowledge puzzle. Then, through a case study, I introduced how to use this knowledge to design a distributed database and apply the learned knowledge to practical work and study.