25 Distributed Database Architecture Selection Database Sharding vs Table Sharding or Middleware

25 Distributed Database Architecture Selection - Database Sharding vs Table Sharding or Middleware #

In the previous few lectures, we learned about sharding design, table structure design, and index design for distributed databases. I believe you now have the ability to build a distributed database system.

But now that the data has been divided and the indexes have been designed, how do we access this data and indexes? That’s the topic we will discuss in this lecture.

There are two ways to access a distributed database:

  • Directly accessing MySQL database nodes by sharding and partitioning;
  • Accessing through middleware.

Let’s first take a look at the scenario of directly accessing a distributed database.

Direct Access via Sharding and Partitioning #

When designing sharding, we have already determined the sharding key information for each table. Therefore, the business or service can directly access the underlying MySQL data nodes based on the sharding key. For example, in the code, we can do something like this:

void InsertOrders(String orderKey, int userKey...) {

  int shard_id = userKey % 4;

  if (shard_id == 0) {
    conn = MySQLConncetion('shard1',...);
    conn.query(...);
  } else if (shard_id == 1) {
    conn = MySQLConncetion('shard2',...);
    conn.query(...);   
  } else if (shard_id == 2) {
    conn = MySQLConncetion('shard3',...);
    conn.query(...);   
  } else if (shard_id == 3) {
    conn = MySQLConncetion('shard4',...);
    conn.query(...);   
  }

}

From this code snippet, we can see that the sharding routing logic is embedded in the business code, and the corresponding shard’s information is calculated on the business layer, and then the database is accessed:

  • The advantage of this processing method is that it is not much different from accessing a single-instance database, except for an additional sharding calculation step, without additional overhead, and the performance is very good (I heard that Alipay’s distributed database uses this direct sharding access method to pursue ultimate performance).
  • The drawback of this processing logic is that the business needs to be aware of the sharding information and any changes in the shards. For the example above, if the shard1 sharding changes or if there is a scaling operation, the business needs to make corresponding modifications.

To solve this drawback, a better approach is to use a naming service instead of directly accessing the shards via IP. This way, when shards switch or scale, the business does not need to make significant changes.

However, since there are many businesses that need to access the sharding logic of the distributed database, it is also reasonable to store the sharding information in the cache and automatically load it when the business starts. For example, in Memcached or Redis, you can store the following sharding information, and the cache key can be the table name of the sharding or partitioning, and the value can store the sharding information in JSON or dictionary format:

{
   "order_table": {
       "shard1": "192.168.0.1",
       "shard2": "192.168.0.2",
       "shard3": "192.168.0.3",
       "shard4": "192.168.0.4"
   },
   "user_table": {
       "shard1": "192.168.0.2",
       "shard2": "192.168.0.3",
       "shard3": "192.168.0.4",
       "shard4": "192.168.0.1"
   },
   ...
}

This way, when the business needs to access a specific shard, it can first query the cache to obtain the corresponding shard’s information and then connect to the corresponding MySQL data node for access.

{
  "key": "orders",
  "shard_info" : {
    "shard_key" : "o_custkey",
    "shard_count" : 4,
    "shard_host" : ["shard1.xxx.com","shard2.xxx.com","..."],
    "shard_table" : ["tpch00/orders01","tpch01/orders02","..."]
  }
}

If you need to access across shards, you need to handle the relevant logic yourself. However, as mentioned before, the design of a distributed database requires unitization, and the majority of operations should be completed within one shard. If this is not possible, then modifying the distributed database may not be recommended.

In summary, the direct access method of sharding and partitioning requires the business to control all operations related to the distributed database, and to have a clear understanding of the specific information for each shard to ensure end-to-end control.

Using middleware technology #

Another popular way to access a distributed database is through a distributed database middleware. The middleware itself simulates a MySQL database and communicates using the MySQL protocol. The way the business accesses the MySQL database is the same way it accesses the distributed database middleware.

The advantage of this approach is that the business doesn’t need to be concerned with the shard information in the distributed database and can simply treat it as a standalone database. This is the way most people think a distributed database should be, as shown in the following graph:

01.jpg

As can be seen, with the use of a distributed MySQL middleware, users only need to access the middleware, and the data routing and implementation of distributed transactions are all handled by the middleware. Therefore, the distributed database middleware becomes a critical core component.

Some well-known distributed MySQL database middleware products in the industry include: ShardingSphere, DBLE, TDSQL, etc.

ShardingSphere became a top-level project of the Apache Software Foundation on April 16, 2020. It has a mature community and provides support for various functions, especially for distributed transaction support with multiple options (ShardingSphere official website).

DBLE is a MySQL middleware product open-sourced by the well-known MySQL service provider ActionSky. It has been used in core businesses of the four major banks, perfectly supporting the exploration of transitioning from IOE to a distributed architecture. In addition to middleware technology, ActionSky also has comprehensive experience in MySQL database, distributed database design, and other aspects.

TDSQL MySQL Edition (TDSQL for MySQL) is a distributed database product developed by Tencent. It features strong consistency, high availability, global deployment architecture, distributed horizontal scaling, high performance, enterprise-level security, and provides supporting facilities such as intelligent DBA, automated operation, monitoring, and alerts. It provides customers with a complete solution for distributed databases.

Currently, TDSQL has provided public and private cloud services for database to over 500+ governmental and financial institutions. Its customers cover banking, insurance, securities, internet finance, billing, third-party payment, Internet of Things, government affairs, and other fields. TDSQL has obtained international and national certifications due to its high-quality products and services, and has received unanimous recognition from customers and the industry.

Please note that while using a database middleware is beneficial, it has an obvious drawback of adding an extra layer of access. The access time for a single transaction will increase, which needs to be considered for business operations that are sensitive to performance.

It is important to note that although the use of a distributed database middleware increases the access time for a single transaction, the overall throughput remains unchanged. By increasing the concurrency, the overall performance of the distributed database can be effectively improved.

How to choose #

So, should you choose direct access to a distributed database or access through a database middleware? This is a question to consider when selecting an architecture.

In my experience, for smaller businesses (with peak transaction rate not exceeding 1000 transactions per second), accessing the distributed database through a database middleware is a better approach.

Because this type of business is usually in the growth stage, meeting the various features of the business may be the main goal. By hiding the shard information with the middleware, developers can focus on business development.

On the other hand, using the distributed transaction provided by the middleware can satisfy simple cross-shard transactions and solve the most difficult problem in distributed databases.

However, if your business is a massive Internet business, the bottleneck of the middleware will become apparent, and the access time for a single transaction will increase, resulting in decreased performance at low concurrency. Moreover, the 2PC distributed transaction performance provided by the middleware may not meet the needs of the business. Therefore, businesses like Alipay and Alibaba do not use the architecture with a distributed database middleware, but instead adopt a direct access mode.

Many people may wonder how to solve problems like JOIN without using a database middleware. It is indeed difficult for the business layer to implement this part of the functionality. Indeed, the middleware can accomplish this functionality. However, if the data volume is large and there are cross-shard scenarios, believe me, the middleware will not meet your requirements.

Therefore, using a distributed database architecture is a compromise, where you need to learn to let go of many things in order to gain more.