10 Routing Strategy How to Let Requests Go to Different Nodes According to the Set Rules

10 Routing Strategy - How to let requests go to different nodes according to the set rules #

Hello, I am He Xiaofeng. In the previous lecture, we introduced the role of health checks in RPC, which, in simple terms, help the calling application manage connections to all service providers and dynamically maintain the status of each connection, allowing the service consumer to obtain an available connection every time a request is initiated. After reviewing the key points from the previous lecture, we will now delve into today’s topic - routing strategy in RPC.

Why choose routing strategy? #

As mentioned earlier, in a real environment, service providers provide services in the form of a cluster. For service callers, this means that there are multiple service providers providing the same service for a single interface. Therefore, in our RPC framework, when initiating a request, we need to select one of the service provider nodes to send the request.

Since any of these nodes can be used to complete the request, we can simply consider these nodes as homogeneous. What does homogeneous mean here? It means that no matter which node in the cluster the request is sent to, the returned result will be the same.

Since service providers provide services in a cluster, we need to consider some practical issues. When we deploy our application, it is not running on a single server. Deployment involves making changes, and any changes can potentially cause the originally running program to behave abnormally, especially when major changes occur, which introduces many factors that make our application unstable.

To reduce this risk, we generally choose to perform a gradual rollout of our application instances. For example, we can first deploy a small number of instances and observe if there are any anomalies. Based on the observations, we can decide whether to deploy more instances or roll back already deployed instances.

However, the downside of this approach is that once there is a problem during production, the impact is quite significant. Because for our service providers, our services are simultaneously provided to many service callers, especially for more complex service callers such as those related to products and prices. If there is a problem with a recently deployed instance, it will affect the business of all service callers.

So, what methods can our RPC framework use to reduce the risk caused by deployment changes? This leads us to the application of routing in RPC. We will continue to explore the specific benefits and implementation methods next.

How to implement routing policies? #

You might say, why don’t we just retest all the scenarios before going live? This is certainly one approach, and testing is definitely an important step before deployment. However, based on my personal experience, the production environment is too complex, and testing alone can only reduce the probability of risks. It is practically impossible to thoroughly verify all scenarios.

So, if we can’t completely eliminate risks, what can we do? I think there is only one way to try, and that is to minimize the impact of potential issues on the business. Based on this idea, can we validate the logic by first routing a small portion of requests after the deployment is complete, and then gradually allow other callers to access it when there are no problems, thereby achieving traffic isolation? So how can we implement this in an RPC framework?

We have discussed service discovery before, and in RPC, service callers obtain the IP addresses of all service providers through service discovery. Can we leverage this feature? When we choose to validate the functionality in a staged manner, can’t we make the registry treat it differently during the push, instead of pushing all service provider IP addresses to all callers indiscriminately? In other words, the registry will only push the IP addresses of the newly deployed service to the specified callers, and other callers will not be able to obtain this IP address through service discovery.

Logic isolation through service discovery seems feasible, but the role of the registry in RPC is to store data and ensure data consistency. If we put such complex computing logic into the registry, it will cause heavy pressure on the registry when the cluster nodes increase. Moreover, in most cases, we usually use open-source software to build a registry, and meeting such requirements would require custom development. Therefore, from a practical standpoint, it is not cost-effective to implement request isolation by affecting service discovery.

Are there any other more suitable solutions? Before I present my solution, you can take a moment to think about your own approach.

We can go back to the process of initiating RPC calls from the caller. When a real request is made in RPC, there is a step of selecting a suitable node from the set of service provider nodes (also known as load balancing), but can’t we add a “filtering logic” before selecting a node to filter out the nodes that meet our criteria? What is the filtering rule? It’s the rule we want to validate during the staged process.

Let me provide a concrete example and you may understand. For example, we want to allow only a specific IP to call the newly deployed node. The registry will distribute this rule to the service callers. After the caller receives the rule, before selecting the specific node to send the request to, it will first filter the set of nodes based on the filtering rule. In this example, it will ultimately filter out one node, which is the newly deployed node. With this modification, the RPC call process would look like this:

This filtering process has a professional term in our RPC, which is “routing policy”. In the above example, the routing policy is an IP-based routing policy, which is used to restrict the IP addresses that are allowed to call the service provider. After using the IP routing policy, the entire cluster’s call topology would look like this:

Parameter Routing #

With IP routing, during the online process, we can achieve the goal of only allowing certain callers to request the newly launched instances. Compared to traditional gray release functions, this approach allows us to minimize the cost of trial and error.

However, in some scenarios, we may need a more granular routing approach. For example, when upgrading and transforming applications, in order to ensure that callers can smoothly transition to calling our new application logic, a common approach during the upgrade process is to run the new and old applications in parallel for a period of time. Then, by gradually increasing the percentage of traffic to the new application, we shift more and more traffic to the new application until it handles 100% of the traffic and has been running for a period of time before we can take the old application offline.

During the traffic switching process, to ensure the integrity of the entire process, we must ensure that all requests for a particular subject object are handled by the same application. Assuming we are transforming the product application, the subject object would definitely be the product ID. In the process of shifting traffic, we must ensure that all operations for a specific product are handled by the new application (or the old application) for all request responses.

Obviously, the IP routing mentioned above cannot meet this requirement, because IP routing only restricts the caller’s origin and does not route requests to the designated service provider node based on the request parameters.

So how can we use routing strategies to implement this requirement? Actually, as long as you understand the essence of routing strategies, it is not difficult to understand the implementation of this parameter routing.

We can label all service provider nodes to differentiate between new and old application nodes. When a service caller makes a request, we can easily obtain the request parameters, such as the product ID in our example. We can determine whether the current request for the product ID should be filtered to the new application or the old application node based on the rules issued by the registry. Because the rules are the same for all callers, it ensures that requests corresponding to the same product ID are either handled by the new application node or the old application node. After using the parameter routing strategy, the entire cluster’s call topology is shown in the diagram below:

Compared to IP routing, parameter routing supports a smaller granularity and provides another means of service governance for service provider applications. Gray release functionality is a typical application scenario for RPC routing, and the combination of RPC routing strategies allows service providers to more flexibly manage and call their own traffic, further reducing the risks that may arise during deployment.

Summary #

In our daily work, we make online changes almost every day. Each change has the potential to cause an incident. In order to reduce the probability of incidents, we need to optimize the operating procedures and ensure that our infrastructure supports lower error costs.

Gray release functionality is a typical application scenario for RPC routing. Through routing functionality, we can achieve advanced service governance features such as targeted calls and black-white lists. Regardless of the routing strategy used in RPC, the core idea is the same: to send requests to target nodes according to the rules we set, thereby achieving traffic isolation.

After-class Reflection #

In the process of using RPC, in addition to implementing functions such as gray release and targeted invocation using routing strategies, have you used it to accomplish other functions?

Please feel free to leave a comment and share your thoughts with me. You are also welcome to share this article with your friends and invite them to join the study. See you in the next class!