Addendum01 Concept Analysis Native Htap Graph and in Memory Databases

Addendum01 Concept Analysis - Native HTAP Graph and In-Memory Databases #

We have completed the main content of our course. After studying the 24 lessons, I believe you have mastered the core concepts of modern distributed databases. The distributed databases I mentioned earlier mainly refer to NewSQL databases. In this additional lesson, I will introduce to you some terms related to distributed databases and their underlying meanings. After studying this lesson, when you encounter a database, you will be able to understand its significance.

Firstly, I will introduce Cloud-Native databases, which are directly related to NewSQL databases. Secondly, I will introduce the hot topic of Hybrid Transactional/Analytical Processing (HTAP) databases. Apart from relational databases, there are various other types of database models, and I will use graph databases as an example to show you the advantages of non-relational databases in dealing with typical problems. Lastly, I will introduce in-memory databases.

Cloud-Native Databases #

Cloud-Native databases are developed based on the concept of Cloud-Native. Cloud-native generally includes two concepts: one is to provide applications to users in a service-oriented or cloud-oriented manner, and the other is that applications can be deployed locally or switched freely between local deployment and cloud deployment based on a cloud-native architecture.

The first concept is widely known. This type of cloud-native database is built, deployed, and distributed on top of cloud infrastructure. The cloud-native attribute is the biggest feature of this type of database compared to other types. As a cloud platform, cloud-native databases are distributed in the form of Platform-as-a-Service (PaaS) and are often referred to as Database-as-a-Service (DBaaS). Users can use this platform for various purposes, such as storing, managing, and extracting data.

Using this type of cloud-native database generally brings several benefits.

Instant Recovery

This refers to the ability of the database to handle crashes or start processes without prior notice. Despite the existence of advanced technologies, certain failures like disk failures, network isolation failures, and virtual machine abnormalities are still inevitable. These failures can be very challenging for traditional databases because running the entire database on a single machine means that even a small problem can affect all functions.

Cloud-native databases are designed with significant instant recovery capability, which means they can immediately restart or reschedule database workloads. In fact, ease of disposal has been extended from a single virtual machine to the entire data center. As our environment continues to evolve towards greater stability, cloud-native databases will develop to a state where they are unaware of such failures.

Security

DBaaS operates in a highly monitored and secure environment, protected by anti-malware, antivirus software, and firewalls. In addition to round-the-clock monitoring and regular software upgrades, the cloud environment also provides additional security. In contrast, traditional databases are vulnerable to data loss and unrestricted access. By leveraging the data capabilities provided by service providers through instant snapshot copies, users can achieve a small Recovery Point Objective (RPO) and Recovery Time Objective (RTO).

Elastic Scalability The ability to be expanded on demand during runtime is a prerequisite for the growth of any enterprise. This ability allows businesses to focus on pursuing their business goals without worrying about storage limitations. Unlike traditional databases that store all files and resources on the same host, cloud-native databases are different. They not only allow you to store data in different ways but are also unaffected by various storage failures.

Cost Saving

Establishing a data center is an independent and complete project that requires a significant hardware investment as well as well-trained operations and maintenance personnel to reliably manage and maintain the data center. In addition, ongoing maintenance can put considerable financial pressure on you. By using a cloud-native Database as a Service (DBaaS) platform, you can obtain a scalable database with lower upfront costs, freeing up your hands to achieve more optimized resource allocation.

The cloud-native databases described above generally use a completely new architecture called NewSQL database. The most typical representative of this is Amazon Aurora, which implements the functionality of InnoDB using log storage and achieves MySQL in a distributed condition. Currently, major cloud providers offer similar advantages in their RDS databases, such as Alibaba Cloud PolarX, and so on.

The second type of cloud-native database can theoretically be deployed either internally within the enterprise (private cloud) or by cloud service providers (public cloud), and many companies try to use a hybrid mode for their deployments. Cloud-native databases designed for this scenario do not maintain infrastructure themselves but make their products runnable on various cloud platforms. ClearDB is a typical representative of this type of database. This cross-cloud deployment improves the overall stability of databases and can improve the data response capability of globally deployed applications.

Of course, cloud-native databases are not limited to relational databases. At present, we have found that databases such as Redis, MongoDB, and Elasticsearch occupy important positions in services provided by various major cloud providers. It can be said that the definition of cloud-native database in a broader sense is very wide.

The above introduced the innovations brought by cloud-native databases on the delivery side. Next, let me introduce the new ideas in data modeling of HTAP fused databases.

HTAP #

Before introducing HTAP, let’s review the concepts of OLTP and OLAP. As I mentioned earlier, OLTP is focused on transactional processing, where the data volume of each transaction is small, but results must be delivered within a short period of time, while OLAP scenarios typically involve computations based on large datasets. These two have already diverged into two separate development paths in the era of big data.

However, OLAP data often comes from OLTP systems, and the two are usually connected through ETL. In order to improve the performance of OLAP, we need to perform a lot of precomputation during the ETL process, including adjustments to the data structure and business logic processing. The advantage of this approach is that it can control the access latency of OLAP and improve the user experience. However, to avoid impacting the OLTP system by extracting data, the ETL process must be started during periods of low system access, typically during midnight for e-commerce systems.

As a result, the data latency between OLAP and OLTP systems is usually around one day, which is commonly expressed as N+1. Here, N refers to the date on which data is generated in the OLTP system, and N+1 is the date when the data is available in the OLAP system, with a 1-day interval between the two. You may have already noticed that the main problem with this system is the data latency in the OLAP system, as a 24-hour delay is too long. Yes, in the era of big data, business decisions rely more on data support, and data analysis continues to penetrate into frontline operations, which requires faster reflection of business changes in the OLAP system.

This is where HTAP comes in. HTAP (Hybrid Transaction/Analytical Processing) is a hybrid transactional-analytical processing concept that first appeared in a 2014 Gartner report. Gartner used HTAP to describe a new type of database that breaks down the barriers between OLTP and OLAP, supporting both transactional database scenarios and analytical database scenarios within a single database system. This concept is fantastic, as HTAP can eliminate the cumbersome ETL process and enable faster analysis of the latest data.

As data originates from OLTP systems, the concept of HTAP quickly became popular among OLTP databases, especially distributed databases and cloud-native databases in the NewSQL style. By implementing HTAP, they attempt to enter the OLAP field. A typical representative is TiDB, where the TiDB 4.0 architecture with TiFlash can meet the HTAP architectural pattern.

TiDB is an architecture with a separation between computing and storage. The underlying storage uses a multi-replica mechanism, and some replicas can be converted into columnar storage. OLAP requests can be directly sent to the replicas with columnar storage, namely the TiFlash replicas, to provide high-performance columnar analysis services. This achieves the innovation and breakthrough of TiDB at the architectural level, where the same set of data can be used for real-time transactions and real-time analysis. Here is the evolution logic and typical representatives of HTAP databases. Now let’s continue expanding the boundaries of knowledge and look at various modes of distributed databases.

In-Memory Database #

Traditional databases are typically designed based on disk storage due to the limitations of hardware resources, such as single CPU, single core, and limited available memory, at the time when database management systems were developed. Storing the entire database in memory was not feasible, so it had to be placed on disks. In-memory databases are NewSQL databases that introduce a completely new architecture at the storage engine level.

With the advancement of technology, memory has become increasingly cheaper and larger in capacity. The memory of a single computer can now be configured to several hundred gigabytes or even terabytes. For a database application, such memory configuration is already sufficient to load all business data into memory for use. Typically, structured data does not have a particularly large scale. For example, the combined transaction data of a bank over 10 to 20 years may only be several tens of terabytes. If such structured data is stored in disk-based DBMS, it often becomes a performance bottleneck for the entire application system when facing large-scale SQL queries and transaction processing due to the limited I/O performance of disks.

If we configure a database server with a sufficiently large amount of memory, can we still use the original architecture by loading all structured data into the memory buffer to solve the performance issues of the database system? Although this approach can improve the performance of the database system to some extent, it is still limited by the disk’s read and write speeds in terms of log mechanisms and updating data to disks, and it does not fully leverage the advantages of large memory systems. In terms of architecture design and memory usage, in-memory database management systems differ significantly from traditional disk-based database management systems.

A typical in-memory database needs to be optimized in the following aspects:

Elimination of write buffers: The traditional buffer mechanism is not applicable in in-memory databases. Locks and data do not need to be stored in two separate places, but concurrency control is still required, and a different concurrency control strategy needs to be adopted compared to traditional lock-based pessimistic concurrency control.
Minimization of runtime overhead: Disk I/O is no longer the bottleneck; new bottlenecks lie in computing performance and function calls, necessitating improved runtime performance.
Scalable high-performance index construction: Although in-memory databases do not read data from disks, logs still need to be written to disks, which raises the issue of log write speed falling behind. The amount of data written to logs can be reduced, such as removing undo information and only writing redo information, or only writing data without updating indexes. If the database system crashes and data is loaded from disks, indexes can be rebuilt concurrently. As long as the base table still exists, indexes can be rebuilt, and it is relatively fast to rebuild indexes in memory.

In conclusion, building an in-memory database is not simply a matter of eliminating disks; it requires optimization in multiple aspects to leverage the advantages of in-memory databases. On the other hand, a graph database achieves high performance by modifying the data model. Let’s now explore its specific features.

Graph Database #

A graph database is a database that uses a graph structure for semantic queries. It uses nodes, edges, and properties to represent and store data. The key concept of this system is the graph, which directly associates data items in storage with collections of nodes and edges that represent relationships between the nodes. These relationships allow data in storage to be connected together directly, and in many cases, retrieval can be done with a single operation. Graph databases prioritize the relationships between data. Querying relationships in a graph database is fast because they are permanently stored within the database itself. Graph databases are useful for highly interconnected data as they allow relationships to be intuitively displayed.

If an in-memory database innovates at the underlying storage layer, a graph database modifies the data model. When constructing a graph database, the underlying storage can use KV storages such as LSM trees or the in-memory storage introduced in the previous lesson. Of course, some graph databases may also use their own storage structures. Due to the fact that the algorithms on which graph data depends do not consider distributed scenarios, two schools of thought have emerged in the distributed systems processing aspect.

Node-Centric Approach

This is a form of traditional distributed theory applied to graph databases. This type of database is node-centric, meaning that adjacent nodes exchange data in close proximity. Graph algorithms follow a more direct pattern, which allows for good concurrency but low efficiency. It is suitable for batch parallel execution of simple graph calculations. A typical representative is Apache Spark. Therefore, this type of database is also known as a graph processing engine.

Algorithm-Centric Approach

This is a database specifically designed for graph computing algorithms. Its underlying data format meets the characteristics of the algorithms, thus enabling the use of fewer resources to process more graph data. However, it is difficult to scale this type of database in a large way. A typical representative of this approach is Neo4j.

Of course, we can generally use both processing modes mentioned above. The first approach can be used for large-scale data preprocessing and data integration, while the second approach can be used for graph algorithm implementation and integration with graph applications.

In summary, the field of graph databases is still in its early stages, and it has great potential in areas such as social networks, anti-fraud, and artificial intelligence. Especially in response to the recent COVID-19 pandemic, many teams have used graph databases to analyze epidemiology, thus demonstrating the advantages of this type of database.

Graph databases, like document-oriented databases and time series databases, are all databases that are designed for specific data models. In addition to the recognition of these domains, the most important underlying reason for the increasing popularity of these databases is the maturation of storage engines and distributed system theories and tools. It is foreseeable that in the future, more databases with domain-specific features will emerge in several popular fields.

Summary #

In this extra section, I introduced various distributed databases. Cloud-native and HTAP are both branches of the development of NewSQL databases. Cloud-native provides users with an out-of-the-box database by starting from the delivery aspect. Meanwhile, HTAP expands the boundaries of NewSQL by introducing OLAP, thus capturing a portion of the traditional big data and data analysis market.

In-memory databases are innovations at the underlying level of the database. In addition to in-memory databases, more and more databases are starting to support object storage in 2021, thanks to the increase in S3 storage bandwidth and the continuous decrease in unit storage prices. In the future, with the inclusion of more innovative hardware, we may see more software-hardware integrated database solutions emerge.

Finally, I introduced graph databases, which have achieved great success as specialized domain databases that cannot be reached by general relational data. With the development of storage engines and distributed theory, these databases will become more and more prevalent.

With so many distributed databases, it can be overwhelming to choose one. In the next extra section, I will provide you with some guidance by combining several typical domains.