04 System Design Goals Ii How to Achieve High Availability in the System

04 System Design Goals II - How to Achieve High Availability in the System #

Hello, I’m Tang Yang.

Since the start of the course, some students have provided feedback that there is a relatively greater emphasis on theoretical knowledge in the lectures, and they would like to see more practical examples. I have been paying attention to these voices and I appreciate your suggestions. In the introduction of Lecture 04, I would like to respond to this.

When designing the course, my main intention was to use the first five lectures in the foundational section to help you understand some basic concepts of designing high-concurrency systems. I hope to provide you with an overall framework, making it easier to delve into and extend the knowledge points covered in the evolution and practical sections of the course. For example, in this section, we mention degradation, and in the operations section, I will use case studies to provide a detailed introduction to different types of degradation strategies and their applicable scenarios. The reason for this design is to connect the course with a small amount of content beforehand, gradually expanding it step by step.

Of course, different opinions serve as my motivation to continuously optimize the course content. I will take every suggestion seriously and continuously improve the course. Let’s work together and progress together.

Now, let’s officially begin the course.

In this section, I will continue to guide you through understanding the second goal of designing high-concurrency systems: high availability. In this section, you need to gain a clear understanding of the ideas and methods for improving system availability. This way, when these topics are elaborated later, you can readily grasp them and apply these methods to optimize your system when facing availability issues.

High Availability (HA) is a term you often hear when designing systems. It refers to the ability of a system to run without failure for an extended period of time.

The HA solutions seen in the documentation of many open-source components are designed to improve the availability of components and prevent the system from becoming unavailable and unable to provide services. For example, in Hadoop 1.0, the NameNode is a single point of failure; once it fails, the entire cluster becomes unavailable. However, in Hadoop 2, the NameNode HA solution was introduced. It starts two NameNodes concurrently – one in an active state and the other in a standby state. They share storage, so if the active NameNode fails, the standby NameNode can be switched to the active state to continue providing services. This enhances Hadoop’s ability to run without failure, which is its availability.

Generally speaking, in a high-concurrency system with high traffic volume, system failures are more damaging to user experience than low system performance. Imagine a system with over a million daily active users; a one-minute failure could affect thousands of users. Moreover, as the number of daily active users increases, the number of users affected by a one-minute failure also increases, which means the system’s requirements for availability will be higher. So today, I will guide you through ensuring high availability in high-concurrency systems, providing you with some ideas for designing your system.

Measurement of Availability #

Availability is an abstract concept, and you need to know how to measure it. The related concepts are: MTBF and MTTR.

MTBF (Mean Time Between Failure) refers to the average time between failures, representing the average time the system operates without failure. The longer this time, the higher the system’s stability.

MTTR (Mean Time To Repair) represents the average time it takes to repair a failure, also known as the average downtime. The smaller this value, the less impact failures have on users.

Availability is closely related to the values of MTBF and MTTR, and we can use the following formula to represent their relationship:

Availability = MTBF / (MTBF + MTTR)

The result of this formula is a ratio that represents the system’s availability. Generally, we use several nines to describe the system’s availability.

In fact, by looking at this image, you can see that achieving availability of one and two nines is relatively easy, as long as there is no damage from the forklift at the vocational school. It can be achieved through manual operation and maintenance.

After three nines, the annual failure time of the system decreases from 3 days to 8 hours. After reaching four nines, the annual failure time is reduced to within 1 hour. At this level of availability, you may need to establish a comprehensive on-duty system, fault handling process, and business change process. You may also need to consider more in system design. For example, in development, you need to consider if failures can be automatically restored without human intervention. Of course, in terms of tool construction, you also need to improve it to quickly troubleshoot and restore the system.

After reaching five nines, failures cannot be restored manually. Imagine that by the time you receive an alert and log in to the server on your computer to address the problem, ten minutes may have already passed. Therefore, the assessment of availability at this level focuses on the system’s disaster recovery and automatic restoration capabilities. Allowing machines to handle failures improves the level of availability.

Generally, the availability of our core business systems needs to reach four nines, and non-core systems can tolerate up to three nines. In practical work, you may have heard similar statements, but the availability requirements vary for different levels and different business scenarios.

At this point, you have gained a certain level of understanding of the evaluation indicators of availability. Next, let’s take a look at the factors to consider in the design of highly available systems.

Approach to Designing Highly Available Systems #

The availability of a mature system relies on both system design and system operation. Both aspects are equally important and cannot be overlooked. So how can we address the issue of high availability from these two perspectives?

1. System Design #

“Design for failure” is the first principle we uphold when designing highly available systems. In a high-concurrency system handling millions of queries per second, there can be hundreds or even thousands of machines in the cluster, and individual machine failures are common occurrences, with the possibility of failures happening almost every day.

Being proactive allows us to be victorious. When designing a system, we must consider failure as an important factor and proactively plan how to automate the detection and resolution of failures. In addition to the proactive mindset, we also need to master some specific optimization methods, such as failover, timeout control, degradation, and flow control.

In general, there are two situations in which failover can occur:

Failover between equivalent nodes.
Failover between non-equivalent nodes, meaning the system has both primary and backup nodes.

Failover between equivalent nodes is relatively simple. In these systems, all nodes handle read and write traffic, and they do not store state. Each node can serve as a mirror for another node. In such cases, if one node fails, we can simply randomly access another node.

As an example, Nginx can be configured to retry requests on another Tomcat node when a request to one Tomcat node fails with a response code greater than 500, like this:

The failover mechanism for non-equivalent nodes is more complex. For example, we may have one primary node and multiple backup nodes, where the backup nodes can be hot backups (providing online services as well) or cold backups (only used for backup purposes). In this case, we need to control how to detect failures of the primary and backup machines in the code, as well as how to switch between primary and backup.

The most widely used failure detection mechanism is “heartbeat”. You can periodically send heartbeat packets from the client to the primary node, or from the backup node. If no heartbeat packets are received within a certain period of time, it can be assumed that the primary node has failed and the operation of selecting a new primary node can be triggered.

The result of selecting a new primary node needs to be agreed upon by multiple backup nodes, so a distributed consensus algorithm such as Paxos or Raft may be used.

In addition to failover, controlling timeouts for inter-system calls is also an important consideration in designing highly available systems.

Complex high-concurrency systems usually consist of many system modules and rely on various components and services, such as caching components and queuing services. The worst thing that can happen during these inter-module calls is a delay rather than a failure, because failures are usually transient and can be resolved through retries. However, once a call to a certain module or service experiences a significant delay, the calling process will be blocked on this call, and the resources it has occupied will not be released. When there are a large number of such blocking requests, the calling process may crash due to resource exhaustion.

In the early stages of system development, timeout control is often not emphasized, or there is no way to determine the correct timeout period.

I experienced a project before where modules were called through an RPC framework with a default timeout of 30 seconds. The system usually ran very stably, but when confronted with a large amount of traffic and a certain number of slow requests on the RPC server side, the RPC client threads would get blocked by these slow requests for up to 30 seconds, causing the RPC client to exhaust its calling threads and crash. After discovering this issue during the post-mortem analysis, we adjusted the timeouts for RPC, databases, caches, and calls to third-party services, so that slow requests could trigger a timeout without causing a system-wide collapse.

Since we need to control timeouts, how do we determine the timeout period? This is a challenging question. A shorter timeout will result in a large number of timeout errors, which will affect the user experience. On the other hand, a longer timeout will not be effective. I suggest that you collect invocation logs between systems and calculate the response time for, say, the 99th percentile, and then set the timeout based on this time. If there are no invocation logs, you can only set the timeout based on experience. However, regardless of the method you use, the timeout needs to be continuously modified during the subsequent system maintenance process.

Timeout control is actually about not allowing requests to persist indefinitely, but instead letting them fail after a certain time and freeing up resources for the next requests to use. This is detrimental to the user, but necessary because it sacrifices a small number of requests to ensure the overall availability of the system. There are also two other degrading strategies that can ensure high availability, which are degradation and throttling.

Degradation is sacrificing non-core services to ensure the stability of core services. For example, when we post a Weibo, it goes through an anti-spam service to check if the content is an advertisement, and only after passing the check will the logic such as writing to the database be completed.

Anti-spam detection is a relatively heavy operation because it involves a lot of policy matching. Although it may be time-consuming under normal traffic, it can still respond normally. However, under high concurrency, it may become a bottleneck, and it is not the main process of publishing a Weibo. Therefore, we can temporarily disable the anti-spam service detection to ensure the stability of the core process.

Throttling is a completely different approach, where the system protects itself by limiting the rate of concurrent requests.

For example, for a web application, I limit a single machine to handle a maximum of 1,000 requests per second. If there are more requests, an error is returned directly to the client. Although this approach damages the user’s experience, it is a necessary action under extreme concurrency and is a short-term measure, so it is acceptable.

In practice, both degradation and throttling have many details worthy of exploration, which I will delve into in later courses as the system evolves. I won’t explain too much in this basic course.

2. System Operations and Maintenance #

In the system design phase, several methods can be adopted to ensure system availability. What can be done in terms of system operations and maintenance? In fact, we can consider how to improve system availability from two aspects: canary release and fault drills.

As you may know, during the smooth operation of business systems, system failures are rare, with 90% of failures occurring during the online change phase. For example, when you introduce a new feature, due to design flaws, the number of slow requests to the database doubles, resulting in system failures.

If there were no changes, how could the database produce so many slow requests? Therefore, to improve system availability, it is particularly important to pay attention to change management. In addition to providing necessary rollback plans for quickly rolling back and recovering in case of problems, another major means is canary release.

Canary release means that system changes are not pushed to production all at once, but gradually progressed in a certain ratio. In general, canary release is performed at the machine level. For example, starting with 10% of the machines, we observe the system performance indicators and error logs on the dashboard. If the system indicators are relatively stable and there are no large numbers of error logs after running for some time, then full deployment can proceed.

Canary release provides developers and operations personnel with an excellent opportunity to observe the impact of changes on production traffic, and it is an important checkpoint for ensuring system high availability.

Canary release is an operational means to ensure system high availability under normal operating conditions. So, how do we know the system performance when a failure occurs? Here is where another means comes into play: fault drills.

Fault drills refer to the use of destructive means to observe the overall system performance when local failures occur, thus finding potential availability issues in the system.

A complex high-concurrency system relies on too many components, such as disks, databases, network cards, etc. These components may fail at any time. Once they fail, will they cause the overall service to be unavailable like the butterfly effect? We don’t know. Therefore, fault drills are particularly important.

In my opinion, fault drills are similar to the popular “chaos engineering” philosophy nowadays, and as the originator of chaos engineering, Netflix introduced the “Chaos Monkey” tool in 2010, which is a great tool for fault drills. It simulates failures by randomly shutting down online nodes, allowing engineers to understand what kind of impact they would have in the event of such failures.

Of course, all of this is based on the premise that your system can withstand some exceptional situations. If your system has not yet achieved this, I suggest setting up a separate offline system that is identical to the production deployment structure, and conducting fault drills on this system to avoid impacting the production system.

Course Summary #

In this class, I have introduced you to how to measure system availability and how to ensure high availability when designing high-concurrency systems.

After all that has been said, you can see that there are different methods of improving availability from the perspectives of development and operations:

From a development perspective, the focus is on how to handle failures, with the keywords being redundancy and trade-offs. Redundancy refers to having backup nodes, clusters to replace failed services, such as mentioned in the text, failover, and active-active architecture, and so on; trade-offs refer to sacrificing pawns to protect the main service.
From an operations perspective, the approach is more conservative, focusing on how to prevent failures, such as paying more attention to change management and how to conduct failure drills.

Only by combining the two can a comprehensive high-availability system be formed.

Another point you need to pay attention to is that improving system availability sometimes comes at the expense of user experience or system performance, and it also requires a great deal of manpower to build corresponding systems and refine mechanisms. Therefore, we need to strike a balance and not over-optimize. Just as I mentioned in the text, a 99.99% availability for the core system is already sufficient to meet the requirements, so there is no need to blindly pursue 99.999% or even 99.9999% availability.

Furthermore, most systems or components usually pursue extreme performance, but are there systems that do not pursue performance but only focus on extreme availability? The answer is yes. For example, a system that delivers configurations only needs to provide a configuration when other systems start, so millisecond-level responses are acceptable, even ten seconds is alright; it merely increases the startup speed of other systems. However, this system has extremely high availability requirements, possibly even up to 99.9999%, because configurations can be obtained slowly, but they cannot be inaccessible. I gave you this example to let you understand that availability and performance sometimes need to be traded off, but how to make the trade-offs depends on different systems and cannot be generalized.