16 Business Isolation How to Isolate Traffic

16 Business Isolation - How to isolate traffic #

Hello, I am He Xiaofeng. In the previous lecture, we introduced the commonly used protection measures in RPC, “circuit breaking and rate limiting.” Circuit breaking is a protective behavior adopted by the caller to avoid depleting its own resources when the service provider encounters problems during the call. Rate limiting, on the other hand, is a protective behavior adopted by the service provider to prevent itself from being overwhelmed by sudden traffic. Although these two measures target different objects, their starting point is self-protection, so once these behaviors occur, the business will be compromised.

Speaking of sudden traffic, rate limiting is one way to address it, but when facing complex business scenarios and high concurrency, we have other means to ensure business continuity as much as possible, and that is traffic isolation. This is also the main topic I want to share with you today, so let’s take a look at the application of business grouping in RPC.

Why do we need grouping? #

In our daily development, don’t we advocate making it as simple as possible for users to use? If we add another grouping dimension to manage the interface, wouldn’t it make things more complicated?

Actually not. Let’s take an example. In the days without cars, our roads were simple, just one road where pedestrians and horse-drawn carts traveled. As cars became popular and increased, our roads became wider and gradually had highways, side roads, sidewalks, and so on. Obviously, the construction and improvement of the transportation network not only improved our efficiency in traveling but also better ensured our pedestrians’ safety.

The same reasoning applies to RPC governance. Suppose you are the person in charge of a service provider application. In the case of low business volume in the early stage, the calling relationship between applications is not complicated, and the request volume is not large. Our application has enough capability to handle all the daily traffic. We don’t need to spend too much time governing the incoming request traffic. We usually choose the simplest method, which is to uniformly manage service instances and handle all requests using a shared “big pool.” This is similar to the “simple road period,” where the call topology between the service caller and the service provider is shown in the following diagram:

In the later stage, as the business becomes more diversified, more and more callers will call your interface, and the traffic will gradually increase. One day, a “surprise explosion” may occur. The traffic from one caller suddenly surges, causing your entire cluster to be under high load, which in turn affects other callers, resulting in a decrease in their overall availability. As the person in charge of the application, you have to transform into a “fire captain” and try every possible means to ensure the stability of the application.

After a series of firefighting operations, we certainly need to think of better coping strategies. Returning to the root of the problem, the key lies in the fact that in the early stage, for the sake of convenience in management, we put all interfaces under the same group, and all service instances provide capabilities as a whole.

However, this crude management mode is no longer applicable in the later stage due to business development. It’s like “cars have come, and our transportation network needs to be constructed urgently,” allowing the separation of people and vehicles. At this time, the people and vehicles on the road are similar to the callers of our application. We can try to divide the big pool provided by the application into different specifications of small pools and allocate them to different callers. The isolation zone between different small pools is what we call grouping in RPC, which can achieve traffic isolation.

How to implement grouping? #

Now that we understand what grouping is, how do we implement it in RPC?

Since different calling applications require different pool contents, we need to think about service discovery. In the RPC process, the logic that affects the calling application’s access to service nodes is service discovery.

In [Lesson 08], we mentioned that the calling application completes service discovery by finding all service nodes through the interface name. However, this approach is not suitable here because the calling application would obtain all service nodes. Therefore, in order to implement grouping isolation logic, we need to refactor the logic of service discovery. When the calling application retrieves service nodes, it needs to carry both the interface name and a group parameter. Similarly, service providers need to include the group parameter when registering.

With the refactored grouping logic, we can divide all service provider instances into several groups, and each group can be used by one or multiple different calling applications. So how do we determine the grouping? Is there a unified standard?

To be honest, there is no measurable standard for grouping, but I have summarized a rule for your reference, which is to divide them based on the importance level of the applications.

Non-core applications should not be grouped with core applications, and core applications should be properly isolated. One important principle is to ensure that core applications are not affected. For example, the interface that provides product information for the e-commerce ordering process should definitely be placed in a separate group to avoid contamination from other calling applications. With grouping, the call topology between our calling applications and service providers is as shown in the following diagram:

By isolating the traffic of the calling applications through grouping, we can avoid the impact on other calling applications’ availability caused by a surge in traffic from one calling application. For service providers, this is a frequently used means in our daily service governance process. However, does this grouping and traffic isolation have any impact on the calling applications?

How to achieve high availability? #

After grouping and isolation, the number of service nodes that a single caller can choose from when making an RPC request is reduced compared to before grouping. This increases the probability of errors for a single caller. For example, if a centralized switch suddenly fails and all the service nodes of this caller are under this switch, the caller’s requests cannot reach the service provider in this case, resulting in a loss of business for the caller.

Is there a more highly available solution? Going back to the example of a road that we mentioned earlier, under normal circumstances, we must make cars travel in their lanes and people walk on the sidewalks. But when the sidewalk or lane is under repair, if conditions permit, we generally allow each other to share the road for a period of time until the road is completely restored.

We can also apply this feature to our RPC system. How can we achieve this?

As we mentioned earlier, when the caller’s application discovers services, it needs to include the corresponding interface name as well as a specific group name. Therefore, the caller cannot get the service nodes of other groups. As a result, the caller cannot establish a connection and send requests.

Therefore, the core problem becomes how the caller can obtain the service nodes of other groups without getting all the service nodes, otherwise, grouping would lose its meaning. One simple way is to allow the caller to configure multiple groups. However, in this case, all these nodes are the same for the caller, and the caller can freely choose any of the obtained nodes to send requests. This means we lose the isolation provided by grouping, and we also fail to achieve our desired “sharing” effect.

Therefore, we need to distinguish between primary and secondary groups in the configuration. Only when all the nodes in the primary group are unavailable, should we select nodes from the secondary group. As long as the nodes in the primary group return to normal, we must switch all the traffic back to the primary nodes. This entire switching process is completely transparent to the application layer, thus ensuring the high availability of the caller’s application to some extent.

Summary #

Today, through a case study on road segmentation, we introduced the concept of using grouping in RPC to artificially partition different clients into small clusters, achieving the effect of isolating traffic and ensuring that our core business is not disrupted by non-core business. However, when considering the problem, we should not lose sight of other aspects, and adding new functionality should not compromise the stability of the existing system.

In fact, not only can we use grouping to divide the service providers into small clusters of different scales, but we can also utilize grouping to achieve multiple implementations of the same interface. Normally, for the convenience of managing our services, it is recommended to ensure that each interface performs a unique function. However, in some special scenarios, two interfaces may be exactly the same, with only slight differences in specific implementations. In such cases, we can expose both interfaces in the service provider application, with only the interface grouping being different.

After Class Reflection #

In our actual work, the work of testers and developers generally runs in parallel, which often leads to a problem: developers may need to start their own applications during the development process, while testers need to deploy the same application in the testing environment to verify the functionality. If the interface group names used by developers and testers happen to be the same, in this case, it may interfere with the functionality verification of other parties involved in the integration testing, thus affecting the overall work efficiency. I wonder if you have any good solutions to this situation?

Feel free to leave a comment and share your thoughts with me. You are also welcome to share this article with your friends and invite them to join the learning. See you in the next class!