22 Development and Limitations Exploring Traditional Databases in the Distributed Domain

22 Development and Limitations - Exploring Traditional Databases in the Distributed Domain #

Starting from this lecture, we enter the practical (extension) module, aiming to help you have a better understanding of modern distributed databases. I will apply the theoretical knowledge we have learned to practical cases.

The explanation of this module is as follows.

  1. Traditional Distributed Databases: Traditional databases like Oracle, MySQL, and PostgreSQL have never stopped exploring in the distributed domain. I will introduce how distributed technologies empower traditional databases and their limitations.
  2. Database Middleware: Although middleware is not strictly within the scope of databases, it plays an irreplaceable role as the entry point for many users to get in touch with distributed databases. I will introduce the functions of database middleware, especially the way it handles transactions, which is different from the distributed transaction introduced in module three.
  3. Modern Distributed Databases: This part mainly introduces NewSQL and DistributedSQL databases that we can currently touch. The focus is on their practices in areas such as cross-regional active-active and disaster recovery.
  4. Other Types of Databases and Database Selection: I will introduce other types of distributed databases to broaden your horizons. Finally, combining scenarios in industries such as finance, telecommunications, and e-commerce, I will explain how these industries choose distributed databases.

The above is the overall explanation process of this module. Now, let’s move on to the study of the first question and see how traditional databases can be transformed into distributed databases.

Distributed Transformation of Traditional Databases #

As I mentioned in module one, business application systems can be divided into two categories based on transaction types: OLTP (Online Transaction Processing) and OLAP (Online Analytical Processing) scenes. OLTP is a transaction-oriented processing process. Although the amount of data for a single transaction is small, the results need to be obtained within a short time. Typical scenarios include shopping and fund transfers. On the other hand, OLAP scenes usually involve computing on large datasets, such as generating various reports.

OLTP and OLAP scenes have significant differences. Although traditional databases initially combined the two, as they transitioned to distributed databases, especially in the field of sharding, OLAP data was gradually abandoned, and the focus was placed on OLTP.

OLTP scenes usually have three characteristics:

  • More writes than reads, and the complexity of read operations is generally low, usually without involving summary calculations on large datasets.
  • Low latency: Users have a low tolerance for latency, usually within 500 milliseconds or even seconds. Typically, a delay exceeding 5 seconds is unacceptable.
  • High concurrency: Concurrency increases with business volume and has no theoretical limit. Traditional databases, such as MySQL and Oracle, are designed to serve OLTP scenarios, but they are generally not considered distributed databases. Why is that? It’s because these databases are traditionally single-node, and the distributed databases we refer to are multi-node.

Traditional relational databases operate in a single machine mode, where the main workload runs on a single machine. In this configuration, the database’s concurrent processing capacity is linearly related to the resources of the single machine, so the upper limit of concurrent processing capacity is restricted by the maximum configuration of the single machine. This method of scaling performance by improving the configuration of a single machine is referred to as vertical scaling (scale up). As we mentioned before, vertical scaling is limited because the increase in the maximum configuration of a physical machine is relatively slow. This means that in a certain period of time, databases relying on vertical scaling will always have a performance ceiling.

Can the single-node mode of traditional databases be transformed into a distributed mode? The answer is yes. These traditional databases can maintain their relational database characteristics while increasing the number of machines and providing a concurrency level much higher than that of a monolithic database through horizontal scaling, known as sharding. This level of concurrency is almost unaffected by the performance limitations of a single machine and is referred to as “high concurrency”. The term “high concurrency” does not have a specific number associated with it. However, I can give you a rule of thumb, which is that “high concurrency” should be at least greater than 10,000 TPS (transactions per second).

In addition to sharding, reliable replication techniques need to be introduced to improve the overall system availability, which is particularly important in financial-level disaster recovery scenarios. These concepts are what I emphasized in Module 1, that sharding and synchronization are the core of distributed databases.

Now that we have covered the basic principles of how traditional databases can be transformed into distributed databases, let’s take a look at how they are actually implemented.

Commercial Products #

As I mentioned in “Introduction: What is a Distributed Database? Its Past, Present, and Future”, commercial databases like Oracle achieve data distribution through distributed storage. In fact, these types of databases have never stopped exploring the field of distribution. Now I will introduce Oracle Sharding.

Starting from version 12.2, Oracle Database introduced the Sharding feature, integrating the advantages of both NoSQL and mature relational databases. Through multiple versions of iterations, it has become a mature distributed relational database solution. Oracle Sharding allows users to distribute and replicate data to a group of Oracle database clusters. The databases in the cluster only require network connections and do not need to share software and hardware. Oracle Sharding provides linear scalability and full fault tolerance for applications.

Oracle Sharding mainly includes the following components.

  • Sharded database (SDB): SDB is a logical database, but physically it consists of multiple physically independent databases. SDB is similar to a database pool, which includes multiple databases (Shards).
  • Shards: SDB consists of multiple physically independent databases, each database is called a shard, and each shard database is located on a different server. These shards are deployed on separate machines, and each shard database stores different subsets of data from the tables. However, each shard has the same columns, which means that these shards are sharded by rows.
  • Shard catalog: It is an Oracle database used to centrally store and manage the configuration information of SDB. It is the core of SDB. Any configuration changes, such as adding or deleting shards, are recorded in the shard catalog. If an application queries data from multiple shards, the shard catalog coordinates and allocates the requests. The shard catalog needs to be highly available (HA) because the data inside is very important. Once it is lost, the entire database becomes unavailable.
  • Shard directors: Global Data Service (GDS) achieves central deployment and management of Sharding. GSM is the core component of GDS and acts as a shard director. GSM functions like a listener, routing client requests to the corresponding shards and load balancing client access.

The advantages of Oracle Sharding are as follows.

  • Linear scalability: Each shard is an independent database, and performance can be linearly scaled by adding new shard nodes. Data is automatically rebalanced.

  • Failure isolation: As Sharding adopts a shared-nothing architecture, each shard uses independent hardware. Therefore, if a shard node fails, it only affects the data stored in that shard, without impacting other shards.

  • Data distribution based on geographical location: It is possible to store data in different shards according to different geographic locations. In addition to the advantages mentioned above, there are also obvious disadvantages.

  • Complex user design: Unlike the traditional RAC mode, Sharding requires users to design tables strictly in order to leverage the scalability and availability advantages of this mode. At the same time, for the migration of old systems, this often means modifying existing code.

  • Low performance for cross-shard operations: The performance of cross-shard transactions and aggregation queries is relatively low. Generally, it is 10% lower than that of single-shard operations.

The final disadvantage is the long-standing problem with commercial databases, which is the low cost-effectiveness. I will further elaborate on this later.

So while the commercial solution looks good, if you prefer open-source solutions, below I will introduce how traditional open-source databases address this issue.

Customization in open-source databases #

For a monolithic open-source database to evolve into a distributed database, the problem of insufficient write performance needs to be addressed.

The simplest and most direct way is to use sharding. The sharding solution involves adding proxy nodes between multiple monolithic databases, essentially adding SQL routing functionality. In this way, the proxy node first parses client requests and then forwards them to the corresponding monolithic database based on the data distribution. There are two modes for proxy nodes: “client component + monolithic database” and “middleware + monolithic database”.

The client component + monolithic database establishes data sharding and routing rules through an independent logical layer, enabling the initial management of monolithic databases and allowing applications to connect to multiple monolithic databases for concurrent and storage capacity expansion. As part of the application system, it has a relatively deep impact on the business. A typical product of this client component is the JDBC client mode of Apache ShardingSphere. The following figure shows the architecture diagram of this mode.

Drawing 0.png

Architecture diagram of Apache ShardingSphere’s JDBC client mode

The middleware + monolithic database manages data rules and routing rules independently as a separate middleware, existing as a separate process that is isolated from the business application layer and monolithic databases, reducing the impact on the application. With the development of the middleware, it will also bring about some distributed transaction processing capabilities. A typical product of this middleware is the Proxy mode of MyCat and Apache ShardingSphere.

Drawing 1.png Architecture diagram of the Proxy mode of Apache ShardingSphere:

Proxy nodes need to implement three main functionalities, which are client access, simple query processor, and access control in process management. In addition, the sharding solution also has an important function, which is shard information management, which refers to the data distribution. However, considering shard information also faces the problem of consistency in multiple replicas, so in most cases it is handled separately. Obviously, if every transaction is limited to a single database, it will greatly restrict the business scenarios.

Therefore, cross-database transactions have become an essential feature, but monolithic databases are not aware of this, so we need to add a distributed transaction component to the proxy nodes. At the same time, simple sharding and splitting cannot fulfill the requirement for global queries, because each data node can only see a part of the data, and certain query operations cannot be performed, such as sorting and multi-table joins. Therefore, the proxy nodes need to enhance their query calculation capability and support queries across multiple monolithic databases. I will discuss more related content in the next lecture.

At this point, the important step towards distributed databases that remains is logical clocks. We have already explained the significance of logical clocks in the distributed systems module, as they are a necessary condition for achieving data consistency. With this final piece, the functionalities that differentiate these distributed databases from monolithic databases are complete, which are sharding, distributed transactions, cross-node queries, and logical clocks.

Generally, these databases are developed based on MySQL or PostgreSQL. The MySQL-like solutions include TDSQL, Vitess, and ShardingSphere with JDTX. PGXC (PostgreSQL-XC) originally refers to an open-source distributed database based on PostgreSQL. Due to the open-source software license agreement of PostgreSQL, many vendors have developed their own products based on PGXC. However, these modifications have not changed the main architecture style, so I refer to these types of products as PGXC-style, including TBase, GuassDB, AntDB, and others.

So far, we have discussed the attempts of traditional databases in the distributed domain in the open-source field. However, these solutions have certain limitations. Let’s see what they are.

Limitations #

The current exploration of traditional databases in the distributed domain can be summarized as “commercial success and open-source cooperation”. They have opened up their own domains. However, their long-term technological accumulation not only brings rich functionalities but also some limitations that cannot be overcome.

  1. Cost-effectiveness: Commercial solutions like Oracle Sharding have comprehensive functionality and can meet various scenarios, which is highly attractive to traditional Oracle users. However, the cost does not correspond to the benefits. The support for sharded transactions is limited, and the performance of cross-shard queries is low. The missing key functionalities are not proportionate to the high cost. Therefore, commercial sharding solutions have never been mainstream.
  2. Transactions: Due to the need for reusing existing storage nodes, most transaction solutions in traditional databases are atomic commit protocols like the two-phase commit we introduced in Module 3. Students who have studied distributed transactions are aware that the traditional two-phase approach has significant limitations in terms of performance and scale. New transaction models must be adopted to break through these limitations. However, the underlying mechanisms of traditional databases are locked, making it difficult to have better performance in this field.
  3. OLAP: Traditional databases can support OLAP well before they are sharded. However, the sharding process makes it increasingly difficult. With the rise of big data technologies, traditional databases have actively given up this field. The new generation of HTAP architectures are dominated by NewSQL and cloud-native databases, which traditional distributed databases cannot match.

In summary, the limitations of traditional databases in the distributed domain can be summarized as the fact that their underlying storage engines limit the expansion of upper-layer distributed functionalities. Only databases like NewSQL, which use innovative storage engines, can build modern distributed databases with matching functionalities and performance. However, due to years of development, these databases have an unshakable advantage in stability and maintainability, and still attract mono-database users despite some limitations.

Conclusion #

In this lecture, we introduced the attempts of transitioning traditional monolithic databases to distributed databases. They generally undergo transformations such as sharding, replication, distributed transactions, and physical clocks to create distributed databases with monolithic databases as data nodes.

At the same time, we discussed the limitations of these databases. Therefore, it is necessary to build distributed databases from the underlying level, just like databases in the NewSQL category, in order to achieve the true development of distributed databases.