15 Throttling and Limiting How to Implement Self Protection in Business

15 Throttling and Limiting - How to implement self-protection in business #

Hello, I am He Xiaofeng. In the previous lecture, I explained the graceful startup in RPC framework, focusing on warm-up and delayed exposure. Today, we will discuss a new topic: how business achieves self-protection when using RPC.

Why do we need self-protection? #

In the Introduction, I mentioned that RPC is a powerful tool for solving communication problems in distributed systems, and one of the key features of distributed systems is high concurrency. Therefore, RPC also faces scenarios with high concurrency. In such cases, every service node that provides services may encounter a series of issues due to excessive access, such as slow business processing, high CPU usage, frequent Full GC, and service process crashes. However, in a production environment, we need to ensure the stability and high availability of services. Therefore, we need to enable self-protection for our business, ensuring that application systems remain stable and services remain highly available even under high traffic and high concurrency scenarios.

So how can businesses achieve self-protection when using RPC?

The most common way is through flow control, which is simple and effective. However, flow control is not the only self-protection mechanism in RPC frameworks, and there are various ways to implement flow control in RPC frameworks.

Let’s break down the RPC framework and analyze it. RPC calls consist of a server and a client, with the client initiating the call to the server. Below, I will share how the server and the client each achieve self-protection.

Self-protection for the server #

Let’s first look at the server. Take a RPC service as an example. The server receives requests from the client. If one of the server nodes is under high load, how do we protect this node?

The solution to this problem is quite simple. If the load pressure is high, we can simply restrict the number of requests it accepts. Once the number of incoming requests decreases and the load pressure on this node decreases naturally.

So, should we implement flow control? Yes, in the context of RPC calls, the self-protection strategy for the server is flow control. But have you ever wondered how we implement flow control? Would we implement flow control in the business logic of the server? Are there more elegant ways to achieve this?

Flow control is a commonly used feature, and we can integrate flow control into the RPC framework, allowing the users to configure the flow control threshold. We can also add flow control logic in the server. When a request comes in, the server can first execute the flow control logic before the business logic. If the access volume is excessive and exceeds the flow control threshold, the server can directly throw a flow control exception back to the client; otherwise, it can execute the business logic normally.

So, how do we implement the flow control logic on the server?

There are many ways, such as a simple counter, or more advanced methods like sliding window, leaky bucket algorithm, and token bucket algorithm. Among them, the token bucket algorithm is the most commonly used. I won’t go into the details of these flow control algorithms here, as there are many resources available online for you to refer to if you’re not familiar with them.

Let’s consider a scenario: we provide a service that is called by multiple applications. One of the applications sends a much larger amount of requests compared to the others. In this case, we should apply flow control on the requests from this particular application. Therefore, when implementing flow control, we need to consider the application-level dimension and even the IP-level dimension. This way, we can apply flow control not only on the requests from a specific application but also on the requests from a specific IP.

You may wonder how users can configure application-level and IP-level flow control. Is it convenient to configure it in code? As I mentioned earlier, the real power of an RPC framework lies in its governance features, most of which rely on a registry or configuration center. Through the management interface of the RPC framework, we can configure the flow control threshold and then deploy it to each server node via the registry or configuration center to achieve dynamic configuration.

By now, you may have noticed that when implementing flow control on the server, the configured flow control threshold applies to each service node. For example, if I configure a threshold of 1000 requests per second, it means that each machine can handle 1000 requests per second. If my service cluster has 10 service nodes, the ideal flow control threshold for the service is 10,000 requests per second.

Let’s consider another scenario: I provide a service that relies on a MySQL database for its business logic. Due to performance limitations of the MySQL database, we need to protect it. For example, if the MySQL database can handle 10,000 SQL statements per second, the access volume of our service cannot exceed 10,000 requests per second. If we have 10 service nodes, the flow control threshold should be set to 1000 requests per second. What if we need to scale up this service in the future? Suppose we expand to 20 service nodes. Do we need to adjust the flow control threshold to 500 requests per second? Calculating and reconfiguring it every time is obviously inconvenient.

We can let the RPC framework handle the calculation. When the flow control threshold is deployed from the registry or configuration center, we can also pass the number of total service nodes to each service node. Then each service node can calculate its own flow control threshold. This solves the problem, right?

Part of the problem is solved, but there is still another issue. In practice, the volume of requests received by each service node is not necessarily evenly distributed. For example, suppose there are 20 nodes and the flow control threshold for each node is 500. Some nodes may have already reached the threshold, while others may have received only 450 requests in a second. Even though the total number of requests from the client has not reached 10,000, flow control may still be triggered. This means that the flow control is not accurate. Is there a more accurate way to implement flow control?

The reason why the flow control methods I mentioned earlier are not accurate is that the flow control logic is executed independently by each node in the service cluster. It is a form of single-machine flow control, and the incoming traffic received by each service node is not evenly distributed.

We can provide a dedicated flow control service, which each node depends on. When request traffic comes in, the service node triggers the flow control logic and calls this flow control service to check if the flow control threshold is reached. We can even place the flow control logic in the client-side. When the client sends a request, it triggers the flow control logic and calls the flow control service. If the request volume has already reached the flow control threshold, the request does not need to be sent. Instead, the client receives a flow control exception from the dynamic proxy.

This method of flow control makes the flow control of the entire service cluster more accurate. However, it also has a disadvantage that it depends on a dedicated flow control service, which may have performance and latency penalties compared to single-machine flow control methods. As for which flow control method to choose, it depends on the specific application scenario.

Self-protection on the caller side #

Earlier, I explained how the server can protect itself, and the simplest and most effective way is through rate limiting. But what about the caller? Does the caller need self-protection?

Let’s consider an example. Suppose I am about to release Service B, which depends on Service C. When Service A calls Service B, the business logic of Service B calls Service C. However, if Service C responds with a timeout, the dependency of Service B on Service C will cause the business logic of Service B to wait indefinitely. At the same time, Service A is calling Service B frequently, which may cause Service B to crash due to a large number of accumulated requests.

As shown, when Service B calls Service C, any exceptions in the business logic execution of Service C will affect Service B and may even cause it to crash. And this is just a case of A -> B -> C. What if we have A -> B -> C -> D -> …? In the entire call chain, if any service encounters a problem, it may cause a series of issues in all upstream services and even crash the entire call chain. This is very frightening.

Therefore, when a service acts as a caller to another service, to prevent problems in the called service from affecting the calling service, the calling service also needs self-protection. And the most effective self-protection measure is circuit breaking.

Let’s understand how circuit breaking works.

The circuit breaker mechanism mainly switches between three states: closed, open, and half-open. Under normal circumstances, the circuit breaker is closed. When an exception occurs in the downstream service called by the caller, the circuit breaker collects exception metrics and performs calculations. When the circuit breaker reaches the threshold, it opens, intercepting requests from the caller and executing the failure logic quickly. After a certain period of time, the circuit breaker switches to the half-open state, allowing the caller to send a request to the server. If the response from the server is normal, the state is set to closed; otherwise, it is set to open.

Now that we understand the circuit breaking mechanism, you will realize that adding circuit breakers in business logic is not elegant enough. So, how do we integrate circuit breakers in an RPC framework?

The circuit breaker mechanism primarily protects the caller, and the caller passes through the circuit breaker when making requests. Let’s recall the RPC invocation process:

Looking at the diagram, can you think of a suitable step to integrate the circuit breaker?

I suggest using dynamic proxies because, in the RPC invocation process, dynamic proxies are the first gateway. When making a request, pass through the circuit breaker. If the state is closed, the request is sent normally, and if the state is open, the circuit breaker’s failure strategy is executed.

Summary #

Today we mainly explained how an RPC framework implements self-protection for business.

The server mainly achieves self-protection through flow control. When implementing flow control, we should consider application and IP levels, so that we can reasonably control the access to applications with large traffic during service governance. The flow control threshold configuration for the server only applies to individual machines. In some scenarios, such as setting flow control thresholds for the entire service or expanding the service, it is not convenient to configure flow control. In this case, we can send the total number of service nodes to the service nodes when sending flow control threshold configurations in the registry or configuration center, and let the RPC framework calculate the flow control threshold. We can also make the flow control module of the RPC framework depend on a dedicated flow control service to precisely control the flow control threshold for services. However, this method relies on the flow control service and has disadvantages in terms of performance and time consumption compared with the single machine flow control method.

The client can achieve self-protection by using a circuit breaker mechanism, which prevents anomalies or excessive response time in the downstream service from affecting the client’s business logic. The RPC framework can integrate the circuit breaker in the dynamic proxy logic to implement the circuit breaker function of the RPC framework.

After-class Reflection #

In the process of using RPC, business needs to implement self-protection. Do you have any other solutions to this problem?

Feel free to leave a comment with your thoughts, and also feel free to share this article with your friends and invite them to join the discussion. See you in the next class!