00 Preface Upgrading to Dispersed Format Databases to Enhance Competitive Ability in the Workplace

00 Preface - Upgrading to Dispersed Format Databases to Enhance Competitive Ability in the Workplace #

Hello, I’m Gao Hongtao, former Huawei Cloud technical expert, former Dangdang.net systems architect and Oracle DBA, as well as a member of the Apache ShardingSphere PMC. As a core member of the founding team, I have been deeply involved in Apache ShardingSphere, which is currently serving hundreds of domestic and foreign companies and has been widely recognized in the industry.

I have been working in the field of distributed database design and development for nearly 5 years, and I often participate in and organize industry conferences such as the China Database Conference and Oracle Carnival, where I exchange ideas with industry professionals on the latest trends and developments in distributed databases.

In the past decade, the entire industry has been rushing into this field, greatly accelerating technological progress. Especially in the past five years, cloud vendors have successively released heavyweight distributed database products, reducing the entry barriers for ordinary users to access this technology. More and more people are getting involved, and the entire field is showing a “blossoming” trend.

Drawing 0.png

Usage statistics of distributed databases released by Alibaba Cloud at the 2021 Data Conference

What opportunities can mastering distributed databases bring you? #

However, in the process of production practice, we will find that many technical professionals still have only a partial understanding of distributed databases, resulting in questions like these:

  • I’ve heard that MongoDB is better than MySQL, but is it suitable for my business?
  • TiDB and Alibaba Cloud PolarDB both seem to support MySQL syntax, what are the differences between them? How should I choose?

Essentially, this is because of a lack of understanding of the basic principles of distributed databases, which can easily lead to frequent problems when using such databases. For example, both Apache Cassandra and Azure CosmosDB support various consistency levels, but if you don’t understand the distributed consistency models, you are likely to make the wrong choice, resulting in inconsistent business data.

As a result, there has been a typical misconception in the industry: that distributed databases can only adhere to the CAP principle and cannot achieve the ACID level of consistency of traditional databases, therefore my business cannot be migrated to a distributed database.

In fact, modern distributed databases (especially NewSQL databases) have already solved this problem to some extent. (I will discuss consistency models in detail in Lecture 5 and Lecture 15, so you will get the answers you want.)

Although in traditional databases, replication synchronization technology is mostly used to improve query performance and availability, these technologies are like a bunch of “patches” that attempt to fix the already overloaded traditional databases. While solving some problems, they may actually bring more problems (for example, replication lag has long been a problem for MySQL’s replication high availability solution).

On the other hand, distributed databases are designed from the ground up for distributed scenarios, so they can solve some tricky problems of traditional databases at a fundamental level. Although the initial investment may be relatively large, it can ensure the healthy development of the subsequent technical system and has significant advantages in long-term costs.

In addition, distributed databases are like a “treasure chest” that contains unique design concepts, well-tested architectural patterns, and endless algorithm details. With the rapid development of distributed databases, more and more development, product, and operation and maintenance personnel will inevitably come into contact with distributed databases to some extent. Therefore, mastering distributed databases will also help you enhance your competitiveness in the workplace and become a shining point on your technical resume.

  • For database engineers, besides daily use, designing database cluster architectures and ensuring the horizontal and vertical scalability of databases are often involved in related interviews. Therefore, understanding the principles of mainstream distributed databases and related case studies will help you respond perfectly.
  • For cloud product managers, mastering the principles of mainstream distributed databases in both commercial and open-source domains is equally important, which is a prerequisite for planning and designing relevant cloud products.
  • Even for mobile app development that does not directly interact with backend databases in general, when trying to solve the synchronization problem of sharing data across multiple terminals, inspiration can be drawn from the principles of distributed databases.
  • When providing system operation and maintenance support, understanding what actually happens inside distributed databases will help in designing reasonable support strategies. Dealing with specific problems will also become more skillful.

What are the difficulties in the learning process? #

However, the learning curve for distributed databases is very steep, and you will find that compared to other types of knowledge, there is a significant difference: the learning materials are abundant and generally difficult.

  • Due to the long development history of database technology, the branches of its evolution are too extensive. Each researcher combines their own professional background and technical expertise to explain distributed databases. Therefore, thoroughly understanding this complex background knowledge has become a challenge for most people who want to delve into this field.
  • At the same time, the academic atmosphere in this field is strong. Therefore, a large amount of core technology is described in the form of papers. Not only are the contents obscure, but most of them are in English, which also raises the threshold for exploring core theories.
  • Some courses often focus on training toward the DBA direction and are generally limited to a specific database (such as cloud database certification or Oracle DBA certification training), without abstracting some common characteristics that would facilitate everyone’s understanding of the core concepts of distributed databases.

This also to some extent leads to continuous “misunderstandings” of the concept of distributed databases. However, this has also strengthened my determination to help you understand the design principles of general-purpose distributed databases and to re-examine practical business practices.

After studying this course, you will have a clearer solution for technology selection, system architecture design, and how to solve key technical problems. In promotion reviews, job interviews, etc., you will also be able to confidently handle related technical questions.

How did I design this course? #

Due to the rich connotation and complex knowledge structure of distributed databases, I have taken three approaches to design this course in order for you to efficiently understand and master the key information.

  1. Simplifying complexity. Remove outdated and unimportant technical details and directly explain the content related to distributed databases. But at the same time, I will also guide you to discover the details behind the technology, hoping to give you the ability to learn by yourself.
  2. Comprehensive knowledge. The content not only introduces the theory related to distributed systems but also introduces storage engines rarely mentioned in general materials. It is the combination of the two that creates the high performance and scalability characteristics of distributed databases.
  3. Emphasizing practicality. In line with the spirit of combining technical concepts with practical cases, when introducing technical details, I will connect them with relevant distributed databases, and help you build your knowledge system from multiple perspectives.

Based on the above design ideas, I divided the course into 4 modules with a total of 24 lessons.

  • Module 1: Historical Evolution and Core Principles of Distributed Data. Starting from the historical background, this module explains the problems to be solved by distributed databases, their application scenarios, and the core technical characteristics.
  • Module 2: High Performance Guarantee of Distributed Databases - Storage Engines. This is the highlight of the course, providing a brief introduction to modern database storage engines, such as typical storage engines, distributed indexes, data file and log-structured storage, and transaction processing. In particular, I will introduce the differences between distributed databases and traditional databases at the storage level. After completing this module, you will have a complete understanding of important features in distributed databases (such as consistency and distributed transactions), and understand why specific storage engines (such as log-structured storage) are more suitable for building distributed databases.
  • Module 3: High Scalability Guarantee of Distributed Databases - Distributed Systems. This module covers detailed explanations of design principles, algorithms, etc., contained in distributed databases, including but not limited to error detection, leader election, reliable data propagation, distributed transactions, consensus algorithms, etc. Although there is a lot of content related to distributed systems, I will not cover everything, but rather extract the essence for you and build your knowledge system based on examples.
  • Module 4: Knowledge Expansion. I will discuss with you the most successful distributed databases in contemporary times (both traditional and new), explore the key reasons for their success, and map them to the technical principles introduced in the previous modules, enriching your knowledge system.

Drawing 1.png

Instructor’s Message #

The design goal of this course is to solve as many of your practical problems as possible, enabling you to have a more professional understanding of database storage in distributed scenarios through various engineering practices, and to establish deep insights into technology trends.