17 High Availability Design How to Effectively Use Three Major Architectural Solutions

17 High Availability Design - How to Effectively Use Three Major Architectural Solutions #

In our previous study, we learned about the principles and optimization of MySQL database replication, as well as the read/write separation solution based on replication technology at the business layer. These contents are all to lay the foundation for the high availability architecture design of the MySQL database. Because replication is the foundation of high availability, but simply synchronizing data through replication is far from enough. You also need to combine your own business to design high availability.

At the same time, high availability is not only about databases, you need to think about designing a truly robust high availability architecture from the perspective of the entire business process.

Now, let’s first take a look at what high availability is and why it is so important.

High Availability Concept #

First, let’s take a look at the definition of high availability on wiki:

High availability (HA) is a characteristic of a system which aims to ensure an agreed level of operational performance, usually uptime, for a higher than normal period.

From the above description, high availability is the ability of a system to provide uninterrupted service. Simply put, it is to avoid service unavailability caused by server downtime.

We all know that high availability is a key point that developers must consider when designing each business system. For example, how does your system perform when it becomes unavailable? Can users tolerate the downtime?

There is also a unified standard in the industry to measure high availability: judging the downtime and calculating how many 9’s of system uptime per year to determine whether the high availability architecture is robust. The specific values are shown in the following table:

图片1.png

Generally speaking, the system should reach at least 4 nines (99.99%), which means the annual downtime should not exceed 52.56 minutes, otherwise the user experience will be very poor and the system will be perceived as unstable.

99.99% = 1 - 52.56 / (3652460) *

However, the impact of 52 minutes of downtime for 4 nines on a production environment is still significant, but requiring 5 nines is too high for most systems. Therefore, some cloud service providers propose a concept of 99.995% availability, which means the maximum downtime for a system in a year is:

Downtime = (1 - 99.995%)36524*60 = 26.28 (minutes) *

That is, the maximum time that affects the service in a year is 26.28 minutes.

After a brief understanding of how important “high availability” is, let’s move on to how to design a high availability architecture.

High Availability Architecture Design #

To achieve high availability, it is necessary to ensure redundancy in both software and hardware and eliminate single points of failure (SPOF).

Redundancy is the foundation of high availability. Generally speaking, the more hardware resources a system invests, the more redundancy it has, and the higher availability the system has.

In addition to ensuring redundancy, the system should also handle failover. That is, detect failures as soon as possible and switch the business to redundant resources.

After clarifying the basic concepts of high availability design mentioned above, let’s take a look at the types of high availability architecture design: high availability design for stateless services and database high availability architecture design.

High Availability Design for Stateless Services #

The high availability design for stateless services (such as Nginx) is very simple. If a problem is detected, simply switch over. You can even remove it directly through the load balancing service when there is a problem:

图片2.png

In the above diagram, when the first Nginx server encounters a problem and becomes unavailable, the load balancing service will detect it and remove it directly.

For upper layer users, they will only encounter an access problem for a few seconds, and then the service will immediately recover. The high availability design for stateless services is that simple.

Database High Availability Architecture Design #

Therefore, when it comes to high availability design for a system, the real difficulty and pain point lies in the high availability design for databases. This is because:

Data persistence in the database is a stateful service;
The capacity of databases is relatively large, so the downtime will be longer compared to stateless services;
Some systems, such as databases in financial scenarios, require no data loss, which increases the difficulty of implementing high availability.

From an architectural perspective, database high availability is itself business high availability, so we need to think about database high availability design from the perspective of the entire business process.

I will provide three methods for database high availability architecture design here. They are not only applicable to MySQL databases, but also to other databases.

Data-based Database High Availability Architecture #

Data-based database high availability architecture is based on data synchronization technology. When the main server (Master) goes down, the failover is performed to the slave server (Slave).

For MySQL databases, this is based on the replication technology introduced earlier. For the read/write separation architecture mentioned in the previous topic, if the main server fails, the following operations can be performed:

图片3.png

It can be seen that our original Slave3 server is promoted to the new master, and a new replication topology is established, with Slave2 and Slave3 both connected to the new Master to synchronize data.

In order to be transparent to the Service service after failover, virtual IP (VIP) technology needs to be introduced. When a failure occurs, the VIP also needs to be moved to the new master server.

So the real difficulty of this architecture is: …

How to ensure data consistency;
How to detect the main server failure;
Handling of fault transfer logic;

We can guarantee “data consistency” through the lossless replication technology provided by MySQL. The “detection of main server failure” and “handling of fault transfer logic” will be completed by the database high availability suite, which we will learn about in the next 20 lessons.

Business-level database high availability architecture #

The second type of “business-level database high availability architecture design” is completely based on business implementation, with the database only used for storing data.

When a primary database server becomes unavailable, the business can directly write to another primary database server. Let’s take a look at this architecture:

图片4.png

From the above image, we can see that when the Service writes to Master1 and fails, instead of waiting for the fault transfer program to enable the master-slave switch, it directly writes the data to Master2.

This may seem like a very simple and rough implementation of high availability architecture, but there aren’t many businesses that meet this design requirement because the prerequisite for this design is that the state can be modified.

For example, in an e-commerce order service, the basic logic is to store the information of each order in the e-commerce business. The core logic is to insert data into the “Orders” table, like this:

INSERT INTO Orders(o_orderkey, ... ) VALUES (...)

Here, “o_orderkey” is the primary key. In order to achieve database high availability based on the business layer, additional information can be added to the primary key generation process, such as the server number. In this way, the primary key design for the order becomes:

PK = Ordered_UUID-Server Number

In this case, if writing to server number 1 fails, the business layer will modify the order’s primary key to server number 2. This way, high availability is achieved at the business layer. This ordering method for order numbers in e-commerce is also called “skipping orders”.

When querying order information, because the server number is included in the primary key, the business knows which server the order is stored on and can quickly route to the specified server.

But the prerequisite for this design is that the entire service’s writing of primary keys can be designed to skip orders, and all queries depend on searching based on the primary key.

Seeing this, do you think it fits the design of NoSQL’s KV access design? Don’t forget about the Memcached Plugin introduced earlier.

Integrated high availability architecture design #

In the aforementioned “business-level database high availability architecture”, although high availability implementation for writing to the business can be achieved through the skipping order design, the query function of the order service will be greatly affected. In the example above, when a crash occurs, orders with server number 1 cannot be queried.

Therefore, I propose a high availability design that integrates the business and data layers. This architecture can solve the problem of limited query services after a crash. The architecture diagram is as follows:

图片5.png

In the above image, orders with different numbers are stored in different databases. For example, orders with server number 1 are stored in the DB1 database, and orders with server number 2 are stored in the DB2 database.

In addition, partial replication technology in MySQL replication is also used here, where the upper-left primary server only synchronizes data from DB1 to the upper-right server. Similarly, the upper-right primary server only synchronizes data from DB2 to the upper-left server. The two lower servers do not change and continue to synchronize data from the original MySQL instance.

The benefits of doing this are:

Under normal circumstances, the above two MySQL databases are dual-active and can both have data written to them, greatly improving the performance of the business.
The order data is complete, with data for server numbers 1 and 2 on one MySQL instance.
More importantly, when a crash occurs, the Service writing is not affected, as orders with server number 1 are written to DB2 through the skipping order design.
At the same time, the reading of orders is also not affected because the data is all on one instance, as shown below:

图片6.png

Multi-Active

Summary #

In this lesson, we learned about the most important high availability design in system design, which is a crucial consideration in business system design. Without high availability in the production environment, it is fundamentally impossible to complete the deployment work.

I recommend rereading this lesson multiple times to deepen your understanding of high availability system design. Because these concepts are not limited to MySQL databases, they apply to all databases and business systems.

Finally, let’s summarize today’s content:

High availability is the ability of a system to provide uninterrupted services, measured in a few 9s;
The high availability target for online systems should not be lower than 99.995%, otherwise the system will frequently crash, resulting in a poor user experience;
The foundation for achieving high availability is: redundancy + fault transfer;
High availability design for stateless services is relatively simple, with direct fault transfer or removal;
As databases are stateful services, their design is more complex (redundancy is achieved through replication technology, and fault transfer requires corresponding high availability suites);
There are three major architectural designs for database high availability, and it is important to remember these designs.