01 Introduction What Is a Dispersed Format Database and Its Past Life and Incarnations

01 Introduction - What is a Dispersed Format Database and Its Past Life and Incarnations #

Hello, welcome to the course on distributed databases. Let’s officially begin.

Before starting this course, I briefly discussed the course outline with my colleagues and friends. At that time, they all showed great interest and, almost simultaneously, asked me the following question: What is a distributed database? Some “enthusiastic learners” even wanted to showcase their “diligence and curiosity” by adding: “Which big company produces this product?”

Well, my friends, you really hit my funny bone. But after laughing it off, I couldn’t help but think: Why is the awareness of distributed databases so low among the public and even in the professional field?

I can roughly summarize the reasons as follows: the characteristics of database products and the commercial atmosphere.

Firstly, the characteristics of database products are highly abstract. Users generally only interact with databases from a usage perspective, knowing what functionalities databases can provide, without caring or finding it difficult to care about their internal principles. And some types of distributed databases highlight this ability to abstract, making users feel that using such distributed databases is not significantly different from using traditional single-machine databases, and may even be easier.

Secondly, the commercial atmosphere surrounding databases has always been substantial. Highly abstract and strategically important, databases naturally become the territory for capital pursuit. The selling point of commercial products and services lies in the support they provide, and the most profitable part of many commercial databases is the provision of such services. As a result, these products deliberately or inadvertently hide the technical details of the database from end users, and with this layer of commercial guarantee, users also have little incentive to proactively understand the internal principles.

This leads to a situation where even if you encounter distributed databases in your work, you may not realize how they differ from traditional databases. But “better late than never” - when frequent technical issues arise due to a lack of necessary understanding of the principles, users will truly recognize that although they may seem similar, they are fundamentally different.

As distributed databases gradually penetrate various fields, users can no longer blindly choose database products based on their features. The new features brought about by new architectures require users’ deep involvement and careful evaluation of the technical characteristics of databases, and even require them to redesign their own products to better integrate with the databases.

Therefore, I have designed this column as a key to help you unlock the door to distributed databases. You can also think of this course as a newbie village task in an online game, where you will receive initial equipment (principles and methodologies) after completion, and then master the necessary knowledge to delve into this field.

As a faithful adherent of “history determines theory”, I will introduce distributed databases by following their developmental context in this lecture. I believe that after reading it, you will have your own answer to the initial question. So let’s start with the basic concepts.

Basic Concepts #

Distributed databases, as the name suggests, can be broken down into two parts: distributed + database. In summary, it refers to databases composed of multiple independent entities that are interconnected through a network.

The best way to understand new concepts is to learn through the knowledge we already possess. The table below compares the main 5 differences between the familiar distributed databases and centralized databases.

Drawing 0.png

From the table, we can summarize the core of distributed databases - data partitioning and data synchronization.

1. Data Partitioning #

This feature is a technological innovation of distributed databases. It can break the capacity limitation of centralized databases and distribute data to multiple nodes to handle data in a more flexible and efficient manner. This is a gift that distributed theory brings to databases.

There are two types of partitioning methods:

Horizontal Partitioning: Divides data into groups by rows and spreads them across different nodes.
Vertical Sharding: Data is sliced by column, and the schema of a data table is divided into multiple smaller schemas.

2. Data Synchronization #

It is the bottom line of distributed databases. Traditional database theory is built on the basis of single-node databases, but with the introduction of distributed theory, the principle of consistency is broken. Therefore, database synchronization technology is needed to help restore consistency.

In short, the goal of data synchronization is to make distributed databases work like “normal databases”. So the driving force behind data synchronization is people’s pursuit of data consistency. These two concepts complement each other and interact with each other.

Of course, distributed databases have other characteristics, but understanding these two points is enough for us to understand them. Below, I will explore the development of distributed databases in the history of technology from these two characteristics. I will use relatively new time periods such as the Internet and cloud computing to divide the history, because our focus is on the present and the future.

Commercial Databases #

Databases before the Internet era, especially before the big data era. When it comes to distributed databases, it is impossible to bypass Oracle RAC.

Drawing 2.png

Oracle RAC is a typical large-scale commercial solution and an integrated software and hardware solution. When I joined a top-tier telecommunications industry solutions company in the early years, I was impressed by its powerful performance and deeply impressed by its high price. It was the benchmark and limit of database performance in that era, and a manifestation of perfect solutions and business achievements.

Let’s use the two characteristics mentioned above to analyze RAC briefly: it does achieve data sharding and synchronization. Each layer is discrete, especially in the bottom layer, where ASM mirroring storage technology is used, making it look like a complete large disk.

The advantage of doing this is to achieve the ultimate user experience, that is, there is no obvious difference in usage between a singleton database and a RAC cluster database. Its distributed storage layer provides complete disk functionality, making it transparent to applications, thereby achieving a balance between scalability and other performance. Even when dealing with a specific scale of data, it has good economic performance.

This distributed database design is called “shared disk architecture”. It is not only the key to RAC’s strength, but also its “Achilles’ heel”. The maximum cluster limit of 8 nodes circulating among DBAs can be considered as the limit scale of RAC.

This scale was completely sufficient in the environment of that time, but with the rise of the Internet, a magnificent “movement” would break the invincible body of Oracle RAC.

Big Data #

We know that commercial databases such as Oracle and DB2 are both OLTP and OLAP fusion databases. The OLAP field was the first to seek breakthroughs on the distributed path. At the beginning of 2000, large database technologies represented by Hadoop launched an attack on relational databases represented by Oracle, based on their “share nothing” technical system.

Drawing 4.png

This is a collision of several groups of concepts: horizontal scaling and vertical scaling, general-purpose economic equipment and dedicated expensive services, open source and commercial. This marks the beginning of truly distributed databases.

Of course, from a general perspective, platforms like Hadoop should not be called databases. But from the two characteristics we summarized earlier, they do meet them very well. Therefore, we can classify them as distributed databases for early commercial analysis scenarios. Since then, OLAP databases have started the independent evolution.

In addition to Hadoop, another type of database called MPP (Massively Parallel Processing) has also experienced rapid development during this period. The architecture diagram of an MPP database is as follows: Drawing 6.png

We can see that this type of database is very similar to Hadoop, which is commonly used in big data, at the architectural level but has a different concept. In short, it is an innovation on hardware systems such as SMP (Symmetric Multiprocessor) and NUMA (Non-Uniform Memory Access), adopting a shared-nothing architecture and connecting multiple SMP nodes through a network to collaborate.

The characteristics of MPP database are that it supports processing of PB-level data and a variety of SQL analysis queries. At the same time, this field is a battleground for commercial products, which includes not only independent vendors such as Teradata, but also some big players like HP’s Vertica, EMC’s Greenplum, etc.

The development of big data technology has separated OLAP analytical databases from traditional relational databases, forming a complete branch of development. With the development of the Internet wave, the OLTP field has also encountered development opportunities.

Internetization #

The first major event in the domestic database field when it entered the Internet era was the “go IOE” campaign.

Drawing 8.png

Among them, the impact of “going Oracle” is particularly far-reaching. Ten years ago, the slogan shouted by Alibaba deeply influenced the domestic database field. Here, we won’t delve into the details or evaluate its positive or negative impact. But in terms of its impact on distributed databases, it has at least brought about two changes in mindset.

Application becomes the core: After “going O”, open-source databases need to be used in conjunction with database middleware (proxy). However, this combination cannot achieve some key features provided by traditional commercial databases, such as rich SQL support and ACID-level transactions. Therefore, application software needs to be carefully designed to ensure compatibility with the new database platform. Application architecture design becomes very critical, and the entire technical architecture starts to shift away from the somewhat mocking “database-centric” programming to becoming application-centric.
Popularization of eventual consistency: Although strong consistency is still in high demand, people have gradually accepted that eventual consistency can be tried in specific scenarios to solve throughput issues. And this brings another benefit - frontline developers and designers start to seriously consider what kind of consistency the business needs, rather than simply relying on the features provided by the database.

Both of these mindsets emerged after the superstitions about Oracle were broken, and they are positive in themselves. But without this movement, it would be difficult for them to gain popularity among ordinary users. And these two mindsets have also had a positive impact on the development of distributed databases, especially domestic distributed databases, in the future.

At the same time, a global wave of NoSQL emerged, which, together with the domestic “go IOE” movement, drove databases towards horizontal distribution. The content about NoSQL will be introduced in detail in the next lecture.

Similar to the big data technologies mentioned in the previous section, with the development of the Internet, the “go IOE” movement has separated OLTP databases from traditional relational databases. However, it is worth noting that this separation is not about building a complete database from scratch, but integrating existing open-source databases and advanced distributed technologies to construct a kind of “quasi” database. It is application-specific, so it cuts off some features of traditional OLTP databases, even some critical features such as subqueries and ACID transactions.

The focus of NoSQL databases is to support unstructured data, such as Internet indexes, GIS geographic data, and temporal-spatial data, etc. Traditionally, such data is stored using relational databases, but forcibly transforming this type of data into a relational structure is not only complex in design, but also inefficient in usage. Therefore, NoSQL databases are considered as a complement to the entire database field, which makes people realize that databases should not only support one data model.

With the development of distributed databases, a completely redesigned distributed OLTP database becomes more and more important on a fundamental level, and cloud computing injects new vitality into this type of database. The combination of the two will bring wonderful chemical reactions to distributed databases.

Cloud-Native is the Future #

From the previous text, we can see that the widely recognized distributed database, that is, the OLTP type transactional distributed database, is still a missing fragment in the field of distributed databases, and it is an important fragment. What characteristics should a true OLTP database possess?

In fact, what people need is a database that not only has the characteristics of a standalone relational database, but also has the characteristics of distributed sharding and synchronization. DistributedSQL and NewSQL are specifically designed for this purpose. They have at least the following two eye-catching features:

Complete support for SQL
Reliable distributed transactions. Some typical representatives include Spanner, NuoDB, TiDB, and Oceanbase, among others. This course will focus on the key features of Distributed SQL, which are the foundation of modern distributed databases. I won’t go into much detail here, as we will cover these features in depth in Lesson 02 | SQL vs NoSQL: Understanding Various “SQL” Together.

At the same time, with the vertical development of cloud computing, distributed databases are encountering a new wave of revolution - cloud-native databases.

Firstly, due to the inherent “overselling” nature of cloud services, the procurement cost is lower, making it easier for end-users to try distributed databases.

Secondly, the support staff from cloud service providers can cooperate deeply with users, forming an efficient feedback mechanism. This feedback mechanism enables rapid iteration of cloud-native distributed databases, allowing them to actively respond to customer needs.

This is the change that cloud-native brings to distributed databases. It transcends traditional commercial databases through ecosystem optimization. The following analysis data from DB-Engines shows that the future database market belongs to distributed databases and cloud-native databases.

Drawing 9.png

With the development of distributed databases, we are experiencing a new integration: the merging of OLTP and OLAP into HTAP (Hybrid Transactional/Analytical Processing) databases.

The emergence of this trend mainly comes from the increasing maturity of cloud-native OLTP-type distributed databases. At the same time, due to the development of the industry as a whole, the demand for real-time analytical databases from customers and vendors is growing. However, traditional big data technologies, including open-source and MPP-type databases, emphasize offline analysis.

If real-time data processing in the millisecond range is required, transactional data and analytical data must be brought closer together and minimize the introduction of non-real-time ELT (Extract, Load, Transform) processes. This leads to the merging of OLTP and OLAP into HTAP. The diagram below shows the HTAP architecture of Alibaba Cloud PolarDB.

Drawing 11.png

Summary #

In the words of the famous Chinese classic novel “Romance of the Three Kingdoms”: “The situation in the world will change, and things that have been separated will merge again, while things that have merged will eventually separate.” And we can observe that the development of distributed databases, and even databases themselves, aligns with this statement.

The development of distributed databases is a process of integration and separation, and then integration again:

In the early stages, the distributed capabilities of commercial relational databases could meet the needs of most users, leading to the emergence of a few giant database products such as Oracle.
In the OLAP field, breakthroughs were sought first, resulting in the evolution of big data technologies and MPP-type databases, providing more powerful data analysis capabilities.
Middleware was introduced to remove the IOE (Isolation, Optimization, Escalation) bottleneck, combined with application platforms and open-source standalone databases, forming a new generation of solutions that allowed commercial relational databases to leave their pedestal, with NoSQL databases further challenging the dominance of relational databases.
The new generation of distributed OLTP databases achieved full support for the core features of distributed databases, representing the maturity of distributed databases and the victory of OLAP and OLTP in their respective domains.
The introduction of HTAP and multi-model data processing has once again merged OLAP and OLTP, pushing distributed databases to a state similar to that of traditional commercial relational databases decades ago, and the impact it will have is even more profound than that of the latter.

We review history in order to better grasp the future. In this course, we will analyze modern distributed databases, key technologies of OLTP-type databases, use cases, and application examples in detail. This will enable you to better evaluate and use distributed databases in the future.

The history of distributed databases also reflects the characteristics of pragmatism. Their evolution is the result of the interplay between demands and technology, rather than being meticulously designed. Our course will also reflect the characteristics of pragmatism, allowing you to learn and apply what you’ve learned.