02 SQL vs No SQL a Clear Understanding of SQL Through the Conflicting SQL Paradigm

02 SQL vs NoSQL - A Clear Understanding of SQL Through the Conflicting SQL Paradigm #

In this lesson, we will begin discussing the past and present of SQL and its variations, as well as the complex relationship it has with distributed databases.

Developers in the 21st century often find themselves in a dilemma: they have to make difficult choices among numerous databases. They would like to follow in the footsteps of their predecessors and choose Oracle or DB2 without thinking, as these databases were proven to be “error-free” choices that would not jeopardize their jobs.

However, times have changed, and even if we choose mature products from reputable companies, we might still fall into a “pit” if we don’t understand the internal mechanisms of various databases. Therefore, choosing the right database has become an essential skill in our daily work.

Of course, there are various ways to classify databases. Over the past 20 years, one classification has been widely adopted: SQL (relational databases) vs NoSQL (other types of databases). As time goes on, new species have emerged, such as NewSQL and DistributedSQL. Looking at their names, these databases all have some connection to SQL, so what role does SQL play?

Let’s start with the conclusion: SQL is the “core” of all databases, even if they claim to say “no” to SQL. How can we understand this? Let’s follow the development of databases to explain and gradually validate this viewpoint.

The Golden Age of SQL #

Let’s start with a simple definition: a SQL database is a database that supports SQL, which is a specific domain language used for querying and manipulating “data” in relational databases. The “relations” in relational databases refer to the “relational model” of data management designed by IBM researcher E.F. Codd in the early 1970s, which gained popularity in System R and many subsequent database systems.

So what are the advantages and disadvantages of SQL and relational databases?

Let’s talk about the advantages first: due to the pre-defined schema, databases have compact storage, which leads to excellent performance. Then there’s the classic ACID, which brings controllability to businesses, and the data access patterns based on standardized SQL bring more benefits to enterprise applications because “standards are productivity.”

The disadvantages are: it requires high requirements for early-stage design because modifying the schema later often requires downtime; it does not consider distributed scenarios and lacks support in terms of scalability and availability. However, distributed systems are essential skills for applications in the 21st century. Please pay attention to this, as it is an important differentiating point between new and old databases.

Since the late 1970s, SQL and relational databases have been the industry standard. Most popular “enterprise” systems are direct descendants of System R and inherit SQL as their query language. The significance of SQL lies in providing a set of access standards for structured data. It is an objective standard free from specific vendor constraints. Although different databases will extend and modify standard SQL, the most commonly used parts remain consistent with the original design.

With the development of SQL, it has been widely used in various commercial and open-source databases. Long-term production practice combined with its excellent design has produced the following two phenomena.

1. Excellent Mass Base

Due to the wide usage of SQL, a group of people who are proficient in this technology has formed. This group is large and includes not only technical professionals but also individuals from other industries such as finance, logistics, and data analysis who consider SQL as an essential skill for their work. Therefore, SQL has a solid mass base.

2. Rich Application Ecology

Objectively, SQL is not conducive to program development. This is because application systems need to write a large amount of raw code to interact with SQL systems, resulting in the existence of an objectively necessary data access layer. This layer adds complexity to the system and has a negative impact on maintenance. In response to this issue, application systems often introduce abstraction layers to shield the complexity of their data access layer, making it easier for business developers to use SQL-like databases. From data access frameworks, ORM, to database middleware, a wave of such technology components frequently enter people’s field of vision, thus building a rich ecosystem.

The interaction of the above two points together creates the golden age of SQL and relational databases. During its peak, almost all types of applications needed to interact with databases, and some even jokingly referred to this as “programming towards databases.” But with the quiet rise of the Internet, the situation slowly began to change.

The Emergence of NoSQL #

NoSQL databases were developed around 2009 and are a type of non-relational database. They focus on data storage and querying in distributed scenarios, do not require pre-defined schemas, generally do not support joins, and are easy to scale. Developers are usually allowed to frequently change schemas online, making it easier to meet business requirements.

NoSQL databases are often used for big data and consumer-facing internet applications due to their large data storage requirements. For example, companies like Twitter, Facebook, Alibaba, and Tencent utilize NoSQL databases to collect tens or even hundreds of terabytes of user data every day.

So, how are NoSQL databases different from SQL databases? The differences are shown in the table below.

Table: Differences between NoSQL and SQL databases

Besides the fact that NoSQL is not SQL, another widely accepted interpretation is “Not Only SQL.” The underlying implication is: we don’t have SQL, but there’s something more attractive than SQL, and that is - distributed systems.

Before the advent of NoSQL, deploying multi-node commercial databases was difficult and costly, often requiring dedicated hardware. Although in theory the scale should be large enough, that wasn’t the case. Furthermore, most NoSQL databases, which appeared later, naturally considered scalability using inexpensive hardware at the design level. Additionally, because they gave up ACID properties, their performance did not decrease as the system scaled.

Of course, the disadvantages of NoSQL are also quite apparent: due to the lack of ACID, data consistency issues need to be handled with extreme care when developing applications. Additionally, since the data models of NoSQL databases often target specific scenarios, it is generally not feasible to use a single NoSQL database to build an entire application, leading to design complexity and maintenance difficulties.

When we examine NoSQL databases, we will notice an interesting fact: their most significant commonality is actually that they have no commonality, and “no” becomes their greatest common factor. This leads me to suspect that NoSQL is essentially a concept created for publicity - it turns a novelty into an image that is anti-traditional and anti-authority, thus achieving the goal of marketing.

As a result, the concept of NoSQL is larger than its content. Although it has attracted a lot of criticism, its significance is also significant. I believe it is mainly reflected in the following three aspects.

First, it breaks inherent thinking

Many applications actually find it cumbersome to use databases, such as social applications in the internet scene dealing with a large amount of unstructured data. This type of scenario can actually be implemented using relational databases, but it requires designing highly scalable applications to support the scenario, as well as experienced DBAs to cooperate, both of which are necessary to ensure the stable operation of the system.

Using MongoDB can solve the scenario problem well, simplify development, and achieve smooth system expansion under a certain number of accesses, reducing operation and maintenance pressure. This has brought new choices to internet companies, especially startups, with limited funds. It can also be seen that not every system needs to be oriented towards relational databases and SQL. It can be said that NoSQL has broken through the shackles of the entire industry, allowing technology to return to human nature, and return to its original intention. Second, Breaking the Monopoly of Big Companies

At that time, the entire industry was under the control of large database service providers such as Oracle and IBM, and their presence could be seen in most commercial scenarios. However, the emerging internet industry, starting from solving practical problems, had fewer constraints and could take bigger steps. Through trial and error and iteration, various databases in the NoSQL category were validated and played a role in real-world business scenarios.

This trend has to some extent broken the monopoly, revitalized the industry, and forced large database service providers to accelerate their progress, resulting in a win-win situation.

Third, Introducing Distributed Systems into Databases

Since then, the concept of distributed databases has become popular, and even the entire technology circle cannot avoid the idea of “distributed data,” giving rise to types such as NewSQL, which we will introduce later.

NoSQL, as a means of promotion, revealed the dilemma of that generation of innovative data storage: they were actually far from SQL, but had to have a profound relationship with it. This proves the argument I made at the beginning: the only core and common point of NoSQL databases is actually SQL.

But in the past decade, with the development of NoSQL, some databases have overcome the shadow of SQL, such as Elasticsearch and Redis. When talking about them, people often do not associate them with the concept of NoSQL. Obviously, they have gained recognition over time and eventually established their own reputation.

The Incursion of NewSQL #

People often criticize NoSQL for “throwing the baby out with the bathwater.” SQL databases are so widely used that it is not worth abandoning SQL for the sake of distributed features.

Therefore, some organizations started building distributed databases based on SQL. On the surface, they all support SQL, but based on their implementation, they have developed two routes: NewSQL and Distributed SQL. In this lecture, I will first introduce the former.

NewSQL is a distributed database built based on the NoSQL model. It typically uses existing SQL-like relational databases as the underlying storage or self-developed engines, and adds a distributed system on top of them, thereby shielding end users from the details of distributed management. Citus and Vitess are two famous examples of this type, which I will discuss specifically in the fourth module.

In addition, some database middleware, such as MyCAT and Apache ShardingShpere, cannot be called NewSQL databases because they fully expose the underlying relational databases, but they can be considered alternative representatives of this pattern.

Around the beginning of 2010, people began to attempt to build databases of this kind. Later, Matthew Aslett of 451 ResEArch coined the term “NewSQL” in 2011 to define these new “scalable” SQL databases.

There are generally two types of NewSQL databases.

The first type provides an automatic data sharding management layer on top of independently running SQL database instances. For example, Vitess uses MySQL, while Citus uses PostgreSQL. Since each independent instance is still a single-node relational database, some critical features cannot be fully supported, such as local failover/repair and distributed transactions across shards. Even worse, some features of single-node databases cannot be used, such as Vitess only supporting a “subset” of subqueries.

The second type includes NuoDB, VoltDB, and Clustrix, which build new distributed storage engines. Although they still have some feature limitations, they can provide users with a complete SQL database experience.

The original purpose of building NewSQL databases was to solve the challenges of writing to SQL databases in distributed scenarios. They can use multiple traditional single-node SQL databases as their storage nodes and build scalable distributed databases on top of them. During the time they were developed, cloud technology was still in its early stages, so these types of NewSQL databases reached a certain level of development. However, as multi-availability zones, multi-regions, and multi-cloud deployments have become the standard for modern applications, these databases have also begun to struggle to keep up. Meanwhile, the rise of Distributed SQL databases like Google Spanner and TiDB has further challenged the position of NewSQL databases. The latter is designed to leverage the characteristics of cloud computing and is an “cloud-native” database that can operate stably in unreliable infrastructure.

We can see that NewSQL has returned to a state where SQL is at its core, showcasing the charm of SQL that can transcend decades. However, this revolution is not complete, as we can still see the presence of traditional single-node databases and the castration of SQL functionality. Furthermore, the revolutionaries themselves often come from the application field rather than professional database institutions. Therefore, NewSQL is more like a user-side carnival. It can solve a class of problems but is not complete, requiring careful evaluation and usage.

The Rise of Distributed SQL #

I mentioned Distributed SQL databases earlier, which use special underlying storage engines to build horizontally scalable databases. In addition to the functionality provided by NewSQL, Distributed SQL databases often offer “geographic distribution” features, allowing users to distribute data across availability zones, regions, or even on a global scale. CockroachDB, Google’s Spanner, OceanBase, and PingCAP’s TiDB are good examples, and these engines typically have higher goals than NewSQL.

It is important to emphasize that NoSQL and NewSQL are built on the assumption that building a fully functional distributed database comes at a high cost and requires some compromises. On the other hand, commercial Distributed SQL databases aim to build such a database at a reasonable cost, and their philosophies are confrontational.

Compared to typical NewSQL databases, a Distributed SQL database looks more like a complete solution. Its features generally include scalability, data consistency, high availability, geographic-level distribution, and SQL support. These are not a combination of tools. A qualified Distributed SQL database should be able to achieve the above functionality without the need for additional tools.

In addition, since Distributed SQL is naturally suitable for integration with cloud computing, some cloud-native databases can also be classified under this category, such as AWS’s Aurora. Whether it is a cloud or non-cloud database, Distributed SQL is almost always a commercial database, while NewSQL, due to its nature as a tool, has a significant open-source presence.

This reflects the great potential and commercial value of Distributed SQL, and from another perspective, it also demonstrates that it is the most orthodox inheritor of the golden age of SQL relational databases.

The new generation of SQL has risen from the old era. However, besides SQL, it has also undergone earth-shattering changes internally. But this is precisely the charm of SQL: it transcends time, remains at the core of databases, and is one of the few legacies of classic database theory.

Summary #

This concludes our discussion, in which we reviewed the past and present of databases and SQL, explored the relationship between current distributed databases and SQL, and understood the significance of the complexity implicit in SQL.

SQL was introduced by relational databases in the 1970s and has since been viewed as the standard query interface for databases, forming a strong mass base. The NoSQL trend that emerged around 2000 is fundamentally unrelated to SQL, but ironically, they had to rely on SQL, their “rival,” to define their own value, causing us to marvel at the resilient vitality of SQL. With the development of NewSQL and Distributed SQL in the past decade, SQL has returned to its origins and transformed from an old-era overlord to a pioneer of the new era.

SQL has certainly undergone changes during this long period and is even significantly different from its earliest version. However, its core principles have not changed, so we still call it SQL rather than giving it a new name.

Therefore, through this review, we can be confident that any successful database needs a natural connection to SQL, and the wonderful design of SQL will help the new generation of distributed databases forge ahead.

Mutual Learning #

After studying the content of this lesson, I hope you will consider the following question: Should InnoDB Cluster, introduced in MySQL 8.0, be classified as which type of distributed database?