18 Mixed Project, the Vulnerability of the Software Domain

18 Mixed Project, The Vulnerability of the Software Domain #

Hello, I am Shi Xuefeng.

There is a famous book in the field of economics called “Antifragile.” Its core concept is that when facing universally existing and unpredictable uncertainties, by using an effective method, we can not only avoid major risks but also leverage risks to achieve returns beyond expectations. Additionally, through active trial and error, controlling the cost of losses, we can continuously improve the benefits when uncertainty events occur.

It sounds magical, doesn’t it? In fact, in the field of software engineering, there are similar thoughts and practices that can help us effectively deal with unforeseen failures when facing extremely complex and large-scale distributed systems. We can not only handle them calmly but also benefit from them. By conducting frequent and extensive experiments, we can identify and address potential risk points, thereby increasing confidence in complex systems. This is the topic I want to share with you today: Chaos Engineering.

What is Chaos Engineering? #

Chaos Engineering is an emerging discipline in the field of software that, just like its name suggests, appears very “chaotic” to many people. So, where does Chaos Engineering come from, and what problems does it aim to solve?

Let’s first take a look at the definition of Chaos Engineering provided by the Chaos Engineering Principles website:

Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.

In simple terms, Chaos Engineering aims to solve the problem of anti-fragility in complex environments with distributed systems. So, what does the real world of “complex and distributed” that we face look like?

Let me give you an example. For a large platform, there can be tens of thousands of daily online activities, serving millions of users. To meet the scale of this business, there are more than 300 client components alone, not to mention the countless backend services.

It can be imagined that in such a complex system, even a small problem in any part can potentially cause online incidents.

In addition, with the rise of technologies such as microservices and containerization, the frequency of independent releases driven by business-organized teams is increasing. Coupled with the constant evolution of architecture, it can be said that hardly anyone can fully comprehend the service invocation relationships of a system, turning a complex system into a “black hole”. No matter how much you tinker with the periphery, it is difficult to uncover the core issue.

To give you a more intuitive understanding of complex real systems, let me share a microservice invocation graph released by Netflix in 2014, which you can refer to.

Image source: [https://www.slideshare.net/BruceWo

Availability Practices #

At this point, you may be wondering if this is just the daily activities of ensuring system availability. Our company also has similar practices, such as fault drills, service degradation plans, and end-to-end load testing. These are all preparatory activities necessary before major promotional events.

Indeed, these practices have similarities with chaos engineering, as chaos engineering was developed from these practices. However, the approach is slightly different.

Most formal companies have a complete set of data backup mechanisms and service emergency response plans in place to ensure system availability and the security of core data in the event of a disaster.

For example, fault drills are targeted simulations of past issues. By defining the scope of the drill in advance and simulating accidents, emergency response plans can be triggered, enabling quick fault localization and service switching. The drill observes the time consumed and performance of various data indicators throughout the process.

Fault drills mostly focus on foreseeable problems, such as physical machine abnormalities like unexpected shutdowns or power outages, device-level issues such as full disk space or slow I/O, and network-related problems like network latency or DNS resolution exceptions. Although these problems can be discussed in great detail, they all have clear paths, defined triggering factors, and specific monitoring and resolution methods.

In addition, it is difficult to cover all types of faults during a drill, so typical failures are typically chosen for verification. However, in actual incidents, multiple variables often fail simultaneously, making the investigation process time-consuming and resource-intensive.

To simulate real online scenarios, many companies have introduced end-to-end load testing technology. This is particularly important for the intense e-commerce industry during major promotions.

For a complete load test, the general process is as follows:

First, prepare a load testing plan, debug the testing scripts and environment, and estimate the testing capacity and scope.
Next, to ensure that online traffic is not affected, complete the switching of data center lines to ensure that no real online traffic is introduced during the test.
Then, execute the load testing plan based on pre-defined testing scenarios, observe the peak flow, and dynamically adjust the testing.
Lastly, after the load testing is completed, switch back the traffic and summarize the results, identifying any issues encountered during the test. During load testing, in addition to monitoring the QPS indicator, it is also necessary to pay attention to TP99, CPU usage rates, CPU load, memory usage, TCP connection counts, and other factors to objectively reflect the availability of services under heavy traffic.

From a business perspective, considering the ever-changing environmental factors, having well-developed service degradation plans and system fallback mechanisms is essential. During periods of high business pressure, it may be necessary to appropriately shield services that are less perceivable to users, such as recommendations, auxiliary tools, log printing, and status indicators, to ensure the availability of the most critical processes. Additionally, introducing queuing mechanisms to a certain extent can help spread instant pressure.

Okay, after talking about various methods for ensuring service availability, can we ensure everything will be foolproof as long as we implement these methods correctly? The answer is no. That is because these activities are all preparation for a known enemy. However, in reality, many problems are unpredictable.

Since the existing practices cannot help us expand our understanding of unavailability, we need an effective experimental method to help us discover potential risks beforehand by combining various elements.

For example, Netflix’s famous “Chaos Monkey” is a tool used to randomly shut down instances in the production environment. Allowing a “monkey” to create chaos in a production environment — isn’t that insane? Well, not really. The power of Netflix’s “Monkeys” is enormous; they can even take down an entire availability zone of a cloud service.

The reason behind this is that even in a cloud service, it cannot be guaranteed that their services will always be reliable. Therefore, do not establish the assumption of availability based on the assumption that dependent services won’t encounter problems.

Of course, Netflix doesn’t have the authority to actually shut down an availability zone in a cloud service. They simply simulate this process, which encourages engineering teams to build multi-region availability systems, promotes architectural designs that can cope with failures, and continuously sharpens engineers’ understanding of resilient systems.

In the words of Nora Jones, a Chaos Engineer at Netflix:

Chaos engineering isn’t about creating problems, it’s about revealing them.

It must be emphasized that before introducing chaos engineering practices, the existing services must already possess resilient patterns and be capable of resolving potential issues as quickly as possible with the support of emergency response plans and automation tools.

If the existing services don’t even meet the basic requirement of recoverability, then chaos experiments would be meaningless. Let me share with you a decision tree for chaos engineering that you can refer to:

Chaos Engineering Decision Tree

Image Source: https://blog.codecentric.de/en/2018/07/chaos-engineering/

Principles of Chaos Engineering #

Chaos Engineering is not like traditional tools and practices. As a discipline, it has rich content and scope. Before entering this field, it is necessary to understand the five principles of Chaos Engineering: establishing a hypothesis of stable state, real-world events, experimenting in production, continuous automation of experiments, and minimal scope of impact.

Let’s take a look at how to practice these five principles.

1. Establishing a hypothesis of stable state

Regarding the stable state of a system, it means which indicators can prove that the current system is normal and healthy. In fact, both technical indicators and business indicators are already powerful enough in existing monitoring systems. Even a slight variation can be detected in a timely manner.

For example, for technical indicators, the representative ones mentioned in the load testing section (QPS, TP99, CPU usage, etc.) are quite significant. As for business indicators, they may vary depending on the specific business of the company.

For example, for a game, the number of online users and average online duration are important; for e-commerce, various arrival rates, completion rates, as well as more macroscopic indicators such as GMV and new user acquisition can demonstrate the health of the business.

Compared to technical indicators, business indicators are more important, especially for industries like e-commerce with intensive activities. Business indicators may be affected by activities, but based on historical data analysis, the overall trends are usually apparent.

When a significant fluctuation occurs in business indicators (such as instant decrease or increase), it means that the system has anomalies. For example, a few days ago, WeChat Pay had problems and the success rate of payments was noticeably affected according to the monitoring.

In the real world, to describe a stable state, a set of indicators needs to form a model, rather than relying on a single indicator. Regardless of whether Chaos Engineering is used, identifying the health status of such indicators is crucial. Moreover, a comprehensive set of data collection, monitoring, and alert mechanisms should be built around them.

I have provided some reference indicators and summarized them in the table below.

Real-world events

Many problems in the real world arise from past “pitfalls”. Even seemingly insignificant events can have serious consequences.

For example, one failure that left a deep impression on me was when the server froze and became unresponsive because the CPU was maxed out while processing concurrent tasks. Through investigation, it was found that when the problem occurred, I/O Wait of the system was high, indicating an I/O bottleneck on the disk. After careful analysis, it was discovered that the battery on the disk RAID card had drained, leading to a degradation of the RAID mode.

In cases like this, it is difficult to avoid problems by monitoring the battery capacity of all RAID cards, and it is not feasible to deliberately replace the battery with a drained one during each simulation of failures.

Therefore, since we cannot simulate all abnormal events, the best cost-benefit ratio is to select important indicators (such as device availability, network latency, various server issues) and conduct targeted experiments. Additionally, by combining methods such as end-to-end load testing, we can test the overall operability of the system from a global perspective. By comparing with the hypothesis indicators of stable state, potential problems can be identified. 3. Experimentation in Production

Similar to the “shift-right” mentality in testing, chaos engineering encourages experimentation in or near the production environment, even directly in the production environment.

This is because real-world problems only arise in the production environment. A small pre-production environment is mainly used to verify system behavior and functionality in accordance with product design, in other words, to verify if there are any new defects or quality regressions from a functional perspective.

However, system behavior can change based on real traffic and user behavior. For example, a piece of breaking news from a popular celebrity may cause the Weibo system to crash, which is difficult to reproduce in a testing environment.

But objectively speaking, there are risks in conducting experiments in the production environment. This requires controllable experiment scope and the ability to stop the experiment at any time. As mentioned before, if the system is not prepared for elastic mode, then production experiments should not be conducted.

Taking load testing as an example, we can randomly select a subset of business modules, define a set of experiment nodes, and then perform ongoing load testing. By periodically directing online traffic to the tested business, we can observe the performance of indicators under sudden bursts of traffic, whether it triggers a system meltdown, if the circuit breaker is effective, and so on. Often, true problems can only be discovered when we are unprepared. This approach, as a practice of chaos engineering, has been widely applied in large companies’ online systems.

Continuous Automated Experiments

Automation is the best solution for repetitive activities. Through automated experiments and automated result analysis, we can ensure that many practices of chaos engineering can be executed at a low cost and in an automated manner. It is because of this that there are more and more tools available under the name of chaos engineering.

For example, the commercial chaos engineering platform Gremlins can support scenarios such as unavailable dependencies, network unreachability, and sudden traffic. This year, Alibaba also open-sourced their chaos tool ChaosBlade, shortening the path to building chaos engineering and introducing more practical scenarios. In addition, open-source tools such as Resilience4j and Hystrix are also very useful. Whether self-developed or directly used, they can help you get started quickly.

I believe that with the maturity of more and more tools, chaos engineering will also become part of the CI/CD pipeline and be incorporated into daily work.

Minimal Impact Scope

The principle of chaos engineering practice is not to interfere with real users’ usage. Therefore, it is necessary to initially control the scope of the experiment to a smaller range to avoid larger problems caused by experiment failure.

For example, by defining a small group of users or services, we can objectively evaluate the feasibility of the experiment. For instance, if we want to experiment with the error-handling capability of an API, we can deploy a new API experiment cluster and modify the routing to redirect 0.5% of traffic for online experiments. In this cluster, we can use fault injection to verify whether the API can handle error scenarios caused by traffic. This is somewhat similar to a gray-scale experiment environment or a dark launch.

In addition to being used for verifying new features or conducting A/B testing, fault injection is also applicable to chaos engineering.

These five principles outline the panorama of chaos engineering, which involves conducting real-world experiments in production environments while controlling the scope of impact and introducing automated methods for continuous experimentation. As a new engineering field, chaos engineering still has a long way to go to bridge the gap of technological evolution.

Reference: - Netflix Chaos Engineering Maturity Model- Curated Chaos Engineering Resources- Netflix Chaos Engineering Handbook

Summary #

In this lecture, I introduced you to a new discipline for addressing the availability challenges of complex distributed systems - chaos engineering. In reality, chaos engineering adopts a completely new approach by actively injecting chaos into the system for experimentation, in order to discover potential real-world problems. In terms of service availability, we have been striving to put these practices into action, such as conducting failure drills, implementing service degradation, and conducting end-to-end stress testing, which have become standard practices for large-scale systems. Finally, I introduced you to the 5 principles of chaos engineering, hoping to help you establish a more comprehensive understanding.

Undeniably, the practice of chaos engineering in China is currently in the experimental stage, but as the complexity of systems continues to increase, chaos engineering is destined to bridge the gap in technological development and become a powerful tool for solving availability issues in complex systems.

Thought question #

What unique experiences do you have regarding exceptional events that occurred in the real world? In light of practicing chaos engineering, do you have any new insights?

Please feel free to write your thoughts and answers in the comments section. Let’s discuss together and learn from each other. If you find this article helpful, please share it with your friends.