22 Dynamic Grouping Highly Efficient Implementation of Second Level Expansion and Contraction

22 Dynamic Grouping - Highly efficient implementation of second-level expansion and contraction #

Hello, I’m He Xiaofeng. In the previous lesson, we talked about how to support traffic playback in RPC. After introducing RPC, all requests will be taken over by the RPC. The reason for introducing playback in RPC is simple, we want to validate the correctness of the application after transformation through online traffic. Online traffic is more diverse compared to manually maintaining Test Cases, so the coverage of testing using online traffic will be broader.

After reviewing the key points of the previous lesson, let’s move on to today’s topic and take a look at the application of dynamic grouping in RPC.

In the 16th lesson, we discussed that in a complex situation on the caller side, if we still let all callers call the same cluster, it is likely that the sudden increase in the calls of non-core businesses will make the entire cluster unavailable, and as a result, the callers of the core businesses will be affected. To avoid this situation, we need to divide the entire large cluster into different small clusters based on different callers to achieve the isolation of traffic for different callers, thereby ensuring that businesses do not affect each other.

Capacity Evaluation after Grouping #

By manually grouping, we can indeed help the service provider to isolate the traffic from the caller, allowing different callers to have their own dedicated cluster, thus ensuring that they do not affect each other. However, this brings a new problem for us as the service provider: how large should the cluster be allocated to each caller?

We touched upon this issue in [Lesson 16]. How should clusters be divided? Of course, the ideal situation would be to allocate a separate group to each caller. However, if there are relatively many callers for the service provider, it would be difficult to manage these relationships. Therefore, when dividing clusters into groups, we generally choose to selectively merge some callers into the same group. This requires the service provider to consider how to merge and which callers to merge.

Since there is no unified standard for this issue, the advice I gave back then was that we can divide them according to the importance level of the applications, so that non-core business applications should not share the same group as core business applications, and it is also preferable for core applications not to use the same group. However, this is only a suggestion for grouping clusters and does not tell you specifically how to divide the cluster size. In other words, you can plan and design the number of groups in your cluster according to this principle.

Following the above principle, after logically dividing the entire cluster into different groups, the next thing we need to do is allocate the corresponding number of machines to each group. So how do we calculate the number of machines for each group? I believe this question cannot stump you. Here, I will first share the common practice used by our team. We generally evaluate the maximum QPS that a single machine of the service provider can handle through load testing, and then calculate the total number of calls for each caller in each group. With these two values, we can easily calculate the number of machines required for this group.

Calculating the number of machines required for a single group based on the QPS of all callers in the group is generally objective and accurate overall. However, because the call volumes of each caller are not constant, for example, if a merchant invites an internet celebrity to do a live broadcast for selling goods, it is likely to cause a slight increase in orders compared to yesterday. Because of the existence of these uncertainties, service providers usually add a certain percentage to the existing call volume when evaluating the capacity for callers, and this percentage mostly comes from historical experience and summaries.

In summary, when calculating the number of machines required for each group, it is necessary to add additional machines to each group in order to give each small cluster a certain capacity to withstand pressure. This capacity depends on the number of machines reserved for the cluster. As a service provider, we certainly want to reserve as many machines as possible for each cluster, but the reality does not allow us to reserve too many machines because it would increase the overall cost for the team.

Problems caused by grouping #

By reserving a small number of machines for grouping, the capacity to withstand pressure of individual clusters can be increased. In general, this mechanism works well, but it can become somewhat difficult to handle during large surges of traffic. Due to the cost of machines, we do not reserve too many for each group, so when the surge in traffic exceeds the capacity of the reserved machines, the cluster of that group is in a dangerous state.

At this point, the only thing we can do is to scale up the group with new machines. However, temporary scaling up of new machines usually takes a relatively long time, and the longer it takes, the greater the impact on the business.

So, is there a more convenient solution? As mentioned before, when we conduct capacity assessments for the groups, we usually add some redundancy. In other words, while ensuring the quality of the caller, the service provider of other groups can also handle some additional traffic. We can find a way to quickly utilize this existing capacity.

But because we have implemented traffic isolation, the entire cluster has been divided into different groups, so the caller with the problem cannot send requests to machines in other groups. You might say, since it takes a long time to apply for temporary expansion of machines, can I just take those redundant machines and change the deployed applications on them to the group with the problem, and then restart? This will increase the number of machines for the service provider of the problematic group.

In terms of results, this approach can indeed solve the problem, but there is a problem, which is that the processing time is still relatively long. And when the traffic for this group is restored, you still have to return the temporarily borrowed machines to their original groups.

With the analysis of the problem to this point, I want to say that dynamic grouping can come in handy.

Application of Dynamic Grouping #

The fundamental reason for the problem mentioned above lies in the sudden increase in traffic from a certain group, while the reserved capacity of this group cannot meet the current traffic demands. However, other service providers in different groups have sufficient surplus capacity. Nevertheless, these surplus capacities are forcefully isolated by our grouping, and we cannot abandon the grouping functionality, otherwise the old problem will resurface.

In this case, we can only temporarily borrow some capacity from other groups when problems arise. However, restarting the application by modifying the grouping is not only a slow process, but it also requires restoration afterwards. Therefore, this rigid approach is clearly not very suitable.

Consider this: the purpose of modifying the application grouping and restarting it is to enable the service caller with the problem to find more service provider machines through service discovery. The data for service discovery comes from the registry. So, can we solve this problem by modifying the data in the registry?

We only need to change the aliases of some instances in the registry to the ones we want, and then use service discovery to influence the set of service provider instances that can be called by different callers.

For example, there are three service instances provided: 2 instances in group A and 1 instance in group B. Caller 1 calls group A, and caller 2 calls group B. If we change the group of one instance in group A from A to B in the registry, after being affected by service discovery, the entire call topology becomes like this:

By directly modifying the data in the registry, we can instantly give any group a cluster capacity of different scales. We can not only change the group name of an instance to another group name, but also make the group name of an instance change to multiple group names. These are the two most common actions in dynamic grouping: append and replace.

Summary #

In [Lesson 16], we discussed the benefits of grouping, which can help service providers achieve isolation for callers. However, since the traffic from callers is not constant and may cause overflow in a certain group due to unexpected events, grouping isolation cannot help the problematic clusters when there is still spare capacity in the entire large cluster.

To solve this problem of sudden traffic overflow, we provide a more efficient solution that enables fast scaling for groups. In fact, we can also use dynamic grouping to solve the problem of reserving redundant machines for each group. It is unnecessary to allocate all the redundant machines to the groups. Instead, we can create a shared pool for these reserved machines, thereby reducing the overall number of reserved instances.

Reflections #

In the process of service governance, we often group services logically. However, a certain group may encounter problems with sudden traffic calls. In this lecture, I presented a solution for dynamic grouping. But during the dynamic grouping process, we only change the data in the service registry, and the service provider’s actual group name is not modified. In this case, requests from callers using dynamic group names may result in errors because the service provider validates whether the caller’s group name matches its own.

To address this problem, can you think of any solutions?

Please feel free to leave a comment and share your answer with me. You can also share this article with your friends and invite them to join the learning. See you in the next class!