Addendum02 Database Selection What Kind of Distributed Database Should We Use

Addendum02 Database Selection - What Kind of Distributed Database Should We Use #

After learning the basic knowledge from the previous 24 lessons and the last additional meal, I have introduced you to various aspects of distributed databases. I believe that this knowledge will help you in three main areas: database development, architectural thinking improvement, and product selection. In particular, the knowledge related to NewSQL databases is very helpful in understanding transaction-oriented scenarios.

With over a decade of experience in the telecommunications and e-commerce industries, I believe you are eager to know the current application status of distributed databases in these mainstream scenarios. In this lesson, I will introduce the use of distributed databases in the banking, telecommunications, and e-commerce sectors.

Let’s start with the e-commerce industry, which is my area of expertise.

E-commerce: From Middleware to NewSQL #

The concept of distributed databases, especially distributed middleware, originated from e-commerce, particularly Alibaba Group. Alibaba was one of the first companies to delve into this field. Let’s take Cobar, developed by Alibaba’s B2B platform department, as an example.

In 2008, Alibaba’s B2B platform established a platform technology department to provide underlying infrastructure for various business units’ products. These platforms cover multiple areas including web frameworks, messaging, distributed services, and database middleware. Some of them were derived from the common frameworks and systems accumulated during the long-term development of various product lines, while others were derived from the discovery of new requirements in existing products and operation processes.

One of these platforms is the database-related platform, which mainly addresses the following three issues:

  1. Provide high-performance, high-capacity, and high-availability access to massive front-end data.
  2. Provide near-real-time guarantees for data changes.
  3. Efficient cross-data center data synchronization.

As shown in the following architecture diagram, the application layer accesses the database through Cobar.

image.png

Access to the database is divided into read operations (select) and write operations (update, insert, and delete). Write operations generate change records in the database, called binlogs in MySQL and redologs in Oracle. The Erosa product parses these change records and caches them in Eromanga in a unified format. The latter is responsible for managing the relationship between producers, Erosa, and consumers of change data. Otter, responsible for cross-data center database synchronization, is one of the consumers of these change data.

Cobar can be considered as a pioneer in OLTP distributed database solutions, and its ideas can still be seen in today’s middleware and even NewSQL databases. However, after serving in Alibaba Group for three years, it gradually stopped being maintained due to personnel changes. At this time, the MyCAT open-source community took over the project and added many features and bug fixes to it, finally establishing its position in various industries.

But as I mentioned before, middleware products are not true distributed databases; they have their limitations. For example, SQL support, query performance, distributed transactions, operation and maintenance capability, and so on, all have limitations that cannot be exceeded. However, some middleware products have been fortunate enough to continue to evolve, eventually evolving into NewSQL or even cloud-native products. Alibaba Cloud’s PolarDB is such a representative. Its predecessor was Alibaba Cloud’s DRDS, a middleware product for database sharding and partitioning, which originated from the TDDL middleware in the Taobao system.

PolarDB differs from traditional middleware in that it adopts a shared storage architecture. The first company to adopt this architecture was a giant in e-commerce to cloud computing: Amazon. And the database product that adopted this architecture is Aurora.

Aurora adopts such an architecture. It pushes the shard boundary down to the underlying transaction and indexing systems. At this point, because the storage engine retains a complete transaction system, it is no longer stateless and typically retains separate nodes to handle services. The storage engine mainly preserves the computing-related logic, while the underlying storage is responsible for storage-related tasks such as redo logging, dirty flushing, and fault recovery. Therefore, this structure is commonly known as a compute-storage-separated architecture or shared disk architecture.

PolarDB takes a different path based on Aurora. The emergence and popularization of RDMA greatly accelerated the network transfer rate between different nodes, and PolarDB believes that the future network speed will approach the bus speed. Thus, the bottleneck will no longer be the network but the software stack. Therefore, PolarDB uses new hardware combined with Bypass Kernel to achieve efficient shared disk implementation and support efficient database services. Since PolarDB has a lower level of sharding, it achieves better ecological compatibility, making it possible to quickly achieve full coverage of community versions. For replica sets, PolarDB uses ParalleRaft to allow unordered confirmation, unordered commit, and unordered apply within a certain range. However, because PolarDB modifies MySQL’s source code and data format, it cannot be hybrid deployed with MySQL. It is more suitable as a cloud-native DBaaS service deployed in the cloud.

“Shared-nothing” NewSQL, represented by Spanner, achieves good scalability in TPCC-like scenarios due to its high level of sharding. However, it requires significant changes to the business, which is a major limitation. “Shared storage” NewSQL, represented by Aurora and PolarDB, tends to have good ecological compatibility with almost zero business modifications at the expense of some degree of scalability.

From Cobar to PolarDB, we have witnessed the growth of distributed databases in the e-commerce and cloud computing fields. Other typical representatives, such as JD Digits’ ShardingSphere, have similar development trajectories. It can be said that the e-commerce industry, and even the Internet industry, has entered a mode of self-developed intellectual property rights, which enables them to have a high degree of control over technology and create a good ecological environment. Above we have discussed the relatively technologically advanced industry of e-commerce. Now let’s take a look at the performance of some traditional industries.

Telecommunications: Moving Away from IOE #

Telecommunications, as an important IT application industry, has been leading the trend of technological development for the past 20 years. However, telecommunications is also an important industry related to the national economy and people’s livelihoods, so its technological approach is more conservative compared to e-commerce and the internet. Oracle databases have long dominated this field, while distributed computing scenarios are supported by many subsystems. Take China Unicom as an example, around 2010, there were hundreds of provincial systems within the Unicom Group, and their data was synchronized using ETL tools, ensuring data consistency at the application layer. Therefore, at that time, there wasn’t a strong demand for distributed databases.

However, around 2012, about 10 years ago, China Unicom began to try to introduce technology from Alibaba Group, and Cobar mentioned earlier was one of them. Why did China Unicom take the lead among the three major telecom operators? The reason is that Unicom had just completed the merger with the old China Netcom and urgently needed to integrate mobile network services with fixed network services. In addition, the Unicom Group headquarters wanted to create a centralized system nationwide, which required assistance from Alibaba Group.

According to people who participated in the project at that time, nearly a thousand people were involved on-site, with many vendor systems working together. Ultimately, the primary functionalities were only verified a few days before the promised launch date. Evaluated by current technological standards, this project was not very successful at that time. However, this project deeply ingrained Alibaba’s “moving away from IOE” concept into the genes of all participants in the telecommunications field, from operators to suppliers. Everyone believed that a solution using distributed databases was the future trend.

Afterward, MyCAT derived from Cobar was implemented in multiple product lines of China Unicom and China Telecom, and various suppliers began to imitate Cobar to develop their own middleware products. It can be said that it was the background described above that gradually led to the widespread acceptance of the database middleware pattern in the telecommunications field, with the main goal being to move away from “IOE”.

After 2016, the telecommunications industry did not give up its pace of evolution. Major suppliers started to attempt to build NewSQL databases, especially those based on the PGXC architecture. PGXC (PostgreSQL-XC) originally referred to an open-source distributed database with PostgreSQL as its core. Due to the influence of PostgreSQL and its open software copyright agreement (similar to BSD), many vendors made secondary developments based on PGXC and launched their own products. However, these modifications did not change the main architectural style, so I categorize these types of products as PGXC-style databases. Examples of such databases include AntDB by Pactera and KingbaseDB by Ruc DataVault. These databases began to quickly gain ground in the industry.

In recent years, the telecommunications industry has gradually been exposed to the most innovative NewSQL databases, but the options in this category are quite limited. Telecom operators are more inclined to cooperate with domestic vendors. For example, TiDB and OceanBase have already been implemented in the three major telecom operators and China Railway Corporation. However, we can see that the current generation of NewSQL databases takes over systems that are either in the non-innovative field or belong to edge businesses, and they have not yet reached the core systems. Nevertheless, we can assume that in the future, more scenarios will adopt innovative NewSQL databases.

Finally, let’s talk about the banking system.

Banking: Steadily Moving Forward #

Banking and telecommunications are very similar industries, but banks tend to have more conservative strategies. Banks have not pushed for a vigorous “IOE” movement internally, so they have always used traditional databases like Oracle and DB2 in the OLTP domain.

It wasn’t until about five years ago that significant changes were made to the IT architecture of banks. For example, Industrial and Commercial Bank of China (ICBC), a benchmark in the industry, implemented a large-scale architecture transformation in 2018, with research and pilot work conducted in earlier years, around 2016-2017. At that time, commercial NewSQL databases had just been launched, but the demanding requirements of financial scenarios meant that banks were unlikely to be the first to adopt them. Instead, a combination of “distributed middleware + open source monolithic database” became more popular, with the anticipation that the successor to this combination, a PGXC-style distributed database, would eventually emerge.

As a result, ICBC chose the combination of the open-source DBLE and MySQL. MySQL was chosen because of its wide prevalence, and DBLE was chosen because it was developed based on MyCat, an open-source database. Since MyCat had already been widely used, this gave DBLE an advantage. This model was well-implemented within ICBC, resulting in a MySQL cluster with thousands of nodes. Although it may seem conservative on the surface, the telecommunications industry had already begun implementing the PGXC architecture during the same period. ICBC, along with the entire banking industry, also embarked on the path of “going IOE.” It can be expected that the entire industry is also moving towards NewSQL databases. Furthermore, ICBC has announced a collaboration with OceanBase, indicating that the wave of NewSQL transformation in the industry is imminent.

In contrast to the relatively plain application of OLTP technology, ICBC’s technological innovation in OLAP is remarkable. During the same period, ICBC, in collaboration with Huawei, successfully developed GaussDB 200 and put it into production. This database was benchmarked against foreign OLAP databases such as Teradata and Greenplum. With the support of ICBC’s case, many other banks are planning to or are already using this product to replace Teradata.

Summary #

In this extra content, I have summarized several key industries’ use of distributed databases, especially OLTP databases. From these cases, we can observe some commonalities:

  1. The development progression goes from standalone databases to middleware to NewSQL, and even the e-commerce sector has started to adopt cloud computing.
  2. Each industry develops based on its own characteristics. Although they have all gone through these stages, there is some technological lag between them.
  3. Early adopters in one industry have driven other industries, especially the technology in the e-commerce sector, which has been utilized by other industries for collaborative development.
  4. The rise of domestic databases: In recent years, we can see the presence of domestic manufacturers in the procurement of new-architecture NewSQL databases. This is partly due to the policy support of state-owned enterprises such as telecommunications and banks, and also because the progress of domestic databases is widely recognized.

These three representative industries outline the development of distributed databases in China. I hope you can gain valuable insights from their development trajectory and apply distributed databases to your work and studies.