34 Degradation Fuse How to Shield the Impact of Non Core System Failures

34 Degradation Fuse - How to Shield the Impact of Non-Core System Failures #

Hello, I’m Tang Yang.

So far, your e-commerce system has built a comprehensive server and client monitoring system and completed end-to-end load testing. Now, however, you have found and resolved many performance issues and hidden dangers in your vertical e-commerce system. But even with careful calculation, there are still some loopholes.

Originally, you were confident in facing the challenge of “Double Eleven” (Singles’ Day), but due to a lack of experience in dealing with massive traffic, there were several instances of brief unavailability during promotions, resulting in a poor user experience for some users. Afterwards, you conducted a detailed retrospective and traced the root causes of the failures. You found that the reasons can mainly be attributed to two categories.

The first category of reasons is the unavailability of dependent resources or services, which ultimately leads to the system being down. For example, in your e-commerce system, the overall service may become unavailable due to slow database access.

The other category of reasons is that you optimistically estimated the potential traffic, and when traffic exceeding the system’s capacity arrives, the system becomes overwhelmed and refuses service.

So, how can you avoid these two types of problems from happening again? I recommend implementing degradation, circuit break, and rate limiting strategies. Rate limiting is the main approach to address the second type of problem (which I will focus on in the next class). Today, I will mainly discuss the approach to solving the first type of problem: degradation and circuit break.

However, before that, let me explain why this problem exists, because only by understanding the principles behind the failures can you better understand the benefits brought by circuit break and degradation.

How Avalanches Occur #

Local failures eventually lead to global failures. This situation has a professional term called “avalanche”. So why do avalanches occur? We know that when a system is running, it consumes some resources, including CPU, memory, and thread resources required to execute business logic.

For example, in the container where business execution takes place, thread pools are usually defined to allocate threads for task execution. For example, in web containers like Tomcat, thread pools are defined to handle HTTP requests. RPC frameworks also initialize thread pools on the server side to handle RPC requests.

The thread resources in these thread pools are limited. If these thread resources are exhausted, the service naturally cannot handle new requests, and the service provider crashes. For example, if your vertical e-commerce system has four services A, B, C, and D, where A calls B, and B calls C and D. Among them, services A, B, and D are core services of the system (such as order services, payment services, etc.), while C is a non-core service (such as anti-spam services, review services).

Therefore, once the traffic for entrance A increases, you may consider scaling services A, B, and D and ignore C. As a result, C may not be able to handle such high traffic, causing slow request processing, further causing the requests in B to be blocked when calling C, and waiting for the response from C. This way, the thread resources occupied in B cannot be released.

Over time, B will become unable to handle subsequent requests due to the full occupation of thread resources. So the requests from A to B will be placed in the queue of the B service’s thread pool, and then the response time of calling B from A will become longer, thereby bringing down the A service. You see, just because the response time of the non-core service C becomes longer, it can cause the entire service to crash. This is a common scenario of service avalanches that we often encounter.

img

So how can we avoid such situations? From my earlier introduction, you can see that cascading reactions occur because the service caller waits too long for the response from the service provider, and its resources are exhausted, triggering an avalanche.

Therefore, in a distributed environment, what the system is most afraid of is not the failure of a single service or component, but rather the fear of slow response. While the failure of a single service or component may only affect a part of the system’s functionality, a slow response could cause an avalanche and bring down the entire system.

When we discuss solutions within the department, we pay special attention to this issue. The solution is to sever the connection between the calling service and the service with abnormal response time when an exception in the response time of a service is detected. This allows the service call to quickly return as a failure, thereby releasing the resources held by this request. This approach is known as degradation and circuit-breaking.

So how do we implement degradation and circuit-breaking? What are their similarities and differences? How do you implement circuit-breaking and degradation in your own projects?

How does the circuit breaker mechanism work #

First, let’s take a look at the implementation of the circuit breaker mechanism. This mechanism is inspired by the protection mechanism of fuses in circuits. When the circuit is overloaded, the fuse will disconnect the circuit to ensure that the overall circuit is not damaged. In the context of service governance, the circuit breaker mechanism refers to that when initiating a service call, if the number of errors or timeouts exceeds a certain threshold, subsequent requests will no longer be sent to the remote service but instead temporarily return an error.

This implementation is also known as the circuit breaker pattern in the field of cloud computing. In this pattern, the service caller maintains a finite state machine for each service call. This state machine has three states: closed (calling the remote service), half-open (attempting to call the remote service), and open (returning an error). The process of switching between these three states is as follows.

When the number of failed calls accumulates to a certain threshold, the circuit breaker switches from closed state to open state. In general, the failed call counter would be reset if a call is successful.

When the circuit breaker is in an open state, a timeout timer is started. When the timer expires, the state switches to a half-open state. You can also detect service recovery by setting a timer periodically.

When the circuit breaker is in a half-open state, requests can reach the backend service. If a certain number of successful calls are accumulated, the state switches to a closed state. If there is a failed call, the state switches to an open state.

img

In fact, not only do microservices need the circuit breaker mechanism for service invocation, but it can also be introduced when accessing resources such as Redis and Memcached. In my team’s self-developed Redis client, we have implemented a simple circuit breaker mechanism. First, during system initialization, we define a timer to periodically check whether the Redis component is available when the circuit breaker is in the open state:

new Timer("RedisPort-Recover", true).scheduleAtFixedRate(new TimerTask() {

    @Override

    public void run() {

        if (breaker.isOpen()) {

            Jedis jedis = null;

            try {

                jedis = connPool.getResource();

                jedis.ping(); // Verify if Redis is available

                successCount.set(0); // Reset the counter for consecutive successes

                breaker.setHalfOpen(); // Set to a half-open state

            } catch (Exception ignored) {

            } finally {

                if (jedis != null) {

                    jedis.close();

                }

            }

        }

    }

}, 0, recoverInterval); // Initialize the timer to periodically check if Redis is available

When operating data in Redis through the Redis client, we incorporate the circuit breaker logic. For example, when the node is in a circuit breaker open state, we directly return null and handle the transitions between the three states of the circuit breaker. The specific example code looks like this:

if (breaker.isOpen()) { 

    return null;  // Return null directly if the circuit breaker is open

}

K value = null;

Jedis jedis = null;

try {

    jedis = connPool.getResource();

    value = callback.call(jedis);

    if (breaker.isHalfOpen()) { // If it is a half-open state

        if (successCount.incrementAndGet() >= SUCCESS_THRESHOLD) { // If the number of successes exceeds the threshold

            failCount.set(0);  // Clear the failure count

            breaker.setClose(); // Set to a closed state

        }

    }

    return value;

} catch (JedisException je) {

    if (breaker.isClose()) {  // If it is a closed state

        if (failCount.incrementAndGet() >= FAILS_THRESHOLD) { // If the number of failures exceeds the threshold

            breaker.setOpen();  // Set to an open state

        }

    } else if (breaker.isHalfOpen()) {  // If it is a half-open state

        breaker.setOpen();    // Set to an open state directly

    }

    throw je;

} finally {

    if (jedis != null) {

        jedis.close();

    }

}

This way, when a Redis node encounters a problem, the circuit breaker in the Redis client can detect it in real-time and no longer request the problematic Redis node, avoiding the overall system avalanche caused by the failure of a single node.

How to Implement a Degradation Mechanism #

Apart from circuit breaking, the most common fault tolerance strategy we hear about in industry discussions is degradation. So, how is degradation done? And how does it relate to circuit breaking?

In my opinion, degradation is a broader concept compared to circuit breaking. It involves sacrificing non-core functions or services of the overall system to ensure its availability. It is a form of fault tolerance that comes with some compromises. That being said, circuit breaking is a type of degradation. Apart from circuit breaking, other forms of degradation include rate limiting degradation and switch-based degradation. (I will discuss rate limiting degradation in the next lesson. This lesson will mainly focus on switch-based degradation.)

Switch-based degradation refers to embedding “switches” in the code to control the return values of service calls. For example, when the switch is closed, the remote service is called normally, and when the switch is open, a degradation strategy is executed. The values of these switches can be stored in a configuration center. When the system encounters issues requiring degradation, the switch values can be dynamically changed through the configuration center, allowing for quick degradation of remote services without restarting the service.

Taking an e-commerce system as an example, besides displaying product data on a product detail page, you may also need to display comments. However, comments are non-core data, and it can be degraded when necessary. So, you can define a switch called “degrade.comment” and write it into the configuration center. The code for this switch is relatively simple, as shown below:

boolean switcherValue = getFromConfigCenter("degrade.comment"); // Get the value of the switch from the configuration center

if (!switcherValue) {

  List<Comment> comments = getCommentList(); // If the switch is closed, retrieve comment data

} else {

  List<Comment> comments = new ArrayList(); // If the switch is open, return empty comment data directly

}

Of course, when designing a switch-based degradation plan, we need to first distinguish between core and non-core services. We can only apply degradation to non-core services. Then, we can formulate different degradation strategies based on specific business scenarios. Let me list some common degradation strategies for you to refer to in your actual work.

For scenarios involving data reading, a common strategy is to directly return degraded data. For example, if the database is under high pressure, we can consider only reading data from the cache instead of the database when degrading; if a non-core interface encounters issues, we can directly return a “service busy” message or a fixed degraded data.

For scenarios involving periodic query data, such as polling for unread counts every 30 seconds, the frequency of data retrieval can be reduced (down to once every 10 minutes).

For scenarios involving data writing, it is common to convert synchronous writes into asynchronous writes, sacrificing some data consistency and timeliness to ensure system availability.

It is worth emphasizing that only tested switches are useful switches. Some colleagues add switches to the system but do not test them. When they need to use the switches in reality, they find that the switches are not effective. Therefore, when adding degradation switches to the system, it is essential to validate and test them during low traffic periods or occasional stress tests to ensure the availability of the switches.

Course Summary #

That concludes the content of this lesson. In this lesson, I have taken you through the causes of avalanches, the implementation of service circuit breakers, and the strategies for service degradation. The key points you need to understand are:

In a distributed environment, the most dangerous situation is when services or components are slow, as this can result in the resources held by the caller being unable to be released, ultimately bringing down the entire service.

The implementation of service circuit breakers is a finite state machine, with the key being the transition between the three states.

The implementation strategies for degradation switches mainly include returning degraded data, rate limiting, and asynchronous approaches.

In fact, switches should not only be used in your degradation strategy. In my projects, whenever a new feature is launched, a switch is always added to control whether the business logic runs the new feature or the old feature. This way, if any unknown issues arise (such as performance problems) after the new feature is launched, they can be quickly rolled back by switching the toggle, reducing the duration of the problem.

In summary, circuit breakers and degradation are important means to ensure system stability and availability. When accessing third-party services or resources, it is necessary to consider adding degradation switches or circuit breaker mechanisms to ensure that when resources or services encounter problems, they will not have a catastrophic impact on the overall system.