05 Scalability How to respond to traffic peaks and valleys #

Hello, I’m Jingyuan.

Today, I want to share with you the core feature of Serverless function computing: dynamic scaling.

Before introducing the implementation principles of scaling, let’s first look at two real-life scenarios.

Scenario 1: Near real-time log-based recommendation service for cars. During the day, when the traffic is high, there is a higher volume of requests. At night, as the traffic decreases, the workload decreases.

Scenario 2: Monitoring program for online services that poll important interfaces every 10 seconds to ensure service availability.

The difference between these two scenarios lies in the smoothness of the workload volume. Imagine if we were to implement Scenario 1 using a PaaS or K8s approach, we would need to allocate resources based on the maximum request volume during the day to ensure that the service can handle the workload in real-time. As a result, most of the resources would remain idle at night, leading to low utilization. For Scenario 2, where the workload is relatively stable, a PaaS service is indeed a more reasonable choice, as the resource utilization is relatively balanced.

Due to its elastic scaling capability, Serverless can perfectly solve the problem in Scenario 1, and for such lightweight scenarios, function computing is the best choice. It can scale down instances to 0 and automatically scale based on the request volume, thereby effectively improving resource utilization.

It can be said that the ultimate dynamic scaling mechanism is the crown of Serverless function computing. Only by wearing this crown can the differences between Serverless and PaaS platforms be highlighted.

So, what is the underlying implementation mechanism behind this “difference”?

Today, I will take you deep into the architecture principles of automatic scaling in different forms, starting with K8s’ Horizontal Pod Autoscaler (HPA), then discussing Knative’s Knative Pod Autoscaler (KPA), an open-source Serverless framework, and see how it differs from HPA. Finally, we will extract the core ideas of automatic scaling, enabling you to understand how to design a system when faced with such a requirement.

Basic Framework #

The core of the open-source Serverless function computing engine is mainly built on K8s’ HPA. Cloud vendors usually have various well-packaged underlying services, which can be used to encapsulate products based on the underlying services, such as container services, container image services, and cloud servers. However, we cannot rule out the possibility that cloud vendors may have made secondary development based on open-source frameworks.

In the development history of cloud vendor container scheduling services, there are usually two scheduling forms: one is based on node scheduling, and the other is based on container instance scheduling (usually referred to as “Serverless,” i.e. not aware of node maintenance). You can simply think of the former as bringing a whole set of houses and managing the entire house and partitions, while the latter only cares about the operation of partitions.

Since cloud vendors’ function computing is usually based on the underlying services of container services, I will now explain the scheduling methods of node and pod to you separately.

Node Dimension #

First, let’s take a look at how the function computing framework dynamically scales in a Node scheduling scenario.

As shown in the above diagram, this is an illustration of the scaling and scheduling of container PaaS services and cloud service Nodes. Let me explain the role of each part.

Scheduler: The scheduling module is responsible for routing requests to the designated function instance (Pod), and also for marking the status of Nodes in the cluster, which is recorded in etcd;
Local-controller: The local controller on the Node is responsible for managing the lifecycle of all function instances on the Node, in the form of a DaemonSet;
AutoScaler: The automatic scaling module periodically checks the usage of Nodes and Pods in the cluster, and scales up or down according to a custom strategy. When scaling up, it usually requests underlying PaaS resources, such as Baidu Intelligent Cloud Engine (CCE) or Alibaba Cloud (ACK/ASK);
Pod: This refers to the two types of containers shown in the diagram, Cold and Warm. Cold means the Pod is not being used, while Warm means it is being used or waiting to be reclaimed;
Node: It has two states, idle and busy. If all Pods on a Node are “Cold,” then the Node is in the idle state, otherwise it is busy.

When the system is running, the AutoScaler periodically checks all Nodes in the cluster. If it detects that a Node needs to be scaled up, it will scale it up based on the defined strategy. Of course, this strategy can be customized, typically by evaluating whether the ratio of usage to the total number falls within a reasonable range.

In this scenario, Pods are usually used as the general form of function instances. The code and different runtimes are often injected into the container as mounted volumes. The AutoScaler also resets a Warm Pod to a Cold Pod based on the idle time, thereby achieving a higher level of reusability for cluster resources.

When scaling down, ideally the AutoScaler can scale the Node down to 0. However, in order to handle sudden traffic requests, a certain buffer is usually reserved.

You will find that the Node scheduling method is more flexible for the function computing scheduling engine. However, “with great power comes great responsibility.” In addition to managing Node scheduling, a Node-based solution also needs to handle the security isolation and usage of Pods within the Node. In this scenario, we can use “empty Pods” to preemptively occupy and pre-load a portion of Pods, thereby speeding up the creation of “real Pods”.

Pod Dimension #

As you may have guessed, the advantage of using Pod as the scaling unit is that it allows for more fine-grained control over the number of function instances. Now, let’s start with the “ancestor” of scaling, HPA (HorizontalPodAutoscaler), and understand the scaling process of the Pod dimension.

Before we begin, you can think about this question: since HPA is used for automatic scaling, can it be directly used for Serverless? Don’t rush to answer, let’s first look at the scaling mechanism of HPA.

Conventional Scaling #

On the Kubernetes official website, there is a definition of HPA, the self-scaling controller.

The HorizontalPodAutoscaler is implemented as a Kubernetes API resource and a controller. The resource determines the behavior of the controller. The horizontal pod autoscaling controller, running within the Kubernetes control plane, periodically adjusts the desired scale of its target (for example, a Deployment) to match observed metrics such as average CPU utilization, average memory utilization, or any other custom metric you specify.

Without needing to fully understand it, you can roughly understand that the scaling controller does two things:

First, it periodically retrieves the metrics data (such as CPU utilization and memory utilization) of the resources from the Kubernetes control plane.

Second, based on these metrics data, it controls the number of resources within a target range.

From the official Kubernetes documentation, you can see the prototype of HPA: HPA controls the actual number of Pods by controlling the Deployment or RC. The official HPA architecture diagram may not be easy to understand, so let’s add the process of collecting metrics data and take a look at it:

What is the specific workflow? You can refer to the diagram to understand the process I am about to describe.

In Kubernetes, different metrics are continuously collected by the corresponding Metrics Server (Heapster or custom Metrics Server). HPA periodically retrieves these metric data (such as CPU and memory usage) from the Metrics Server API or the aggregated API server, and then calculates the desired number of Pods based on the scaling rules you define. Finally, based on the current actual number of Pods, it adjusts the RC/Deployment to achieve the desired number of Pods.

Up to this point, you should be able to quickly answer that HPA scaling cannot be directly used for Serverless.

This is because in the semantic of Serverless, dynamic scaling can scale a service down to 0, but HPA cannot. HPA’s mechanism is to scale the Deployment based on metrics of the Pods. If the number of replicas of the Deployment is scaled down to 0 and there is no traffic, the metric would be 0, and HPA would not be able to handle it.

Extreme Scaling #

Precisely because of this, additional mechanisms are needed for scaling from 0 to 1. Since the community and major cloud vendors attach great importance to Knative, and some traditional enterprises choose Knative as the preferred underlying platform for private deployment in their transformation process, I will explain based on Knative’s KPA.

Knative’s scaling mainly includes three aspects: collecting traffic metrics, adjusting the number of instances, and the process from 0 to 1.

Collecting Traffic Metrics

The most important step in scaling is to collect the metrics we need. Let’s understand it by referring to the diagram.

In Knative, a Revision represents an unchanging snapshot of code and configuration at a certain moment. Each Revision refers to a specific container image and any specific objects it needs (such as environment variables and volumes) for running. The Deployment controls the number of function instances, and the instances I mentioned refer to the User Pods in the diagram.

Unlike the function instances in the scaling mechanism at the Node dimension we introduced earlier, each function instance (User Pod) here has two containers: Queue Proxy and User Container. I believe you have already guessed that the User Container deploys and runs our own business code, but what about the Queue Proxy?

In fact, when each function instance is created, the Queue Proxy is injected as a Sidecar, which means adding an abstraction layer next to the original business logic. Queue Proxy serves as the traffic entry point for each User Pod, responsible for rate limiting and traffic statistics. AutoScaler collects the traffic statistics from Queue Proxy at regular intervals, which will be an important basis for subsequent scaling.

In simple terms, it means adding a twin Queue Proxy next to the container that deploys our own business code. The method for adding the twin is Sidecar. Regardless of what the brothers do in this house (User Pod), the twin will record everything. The outermost AutoScaler will collect all the information from this twin to prepare for subsequent scaling.

Adjusting the Number of Instances

After collecting the traffic metrics, AutoScaler needs to adjust the number of instances based on the metrics. Autoscaler determines the final number of instances by changing the Deployment of the instances, in order to determine how many Pods to scale out.

The simple algorithm is to evenly distribute the current total concurrency across the desired number of instances, meeting the set concurrency value.

Here’s a simple example: if the current total concurrency is 100 and the set concurrency value is 10, then the final number of instances adjusted would be: 100/10 = 10. Of course, the number of instances for scaling in and out also takes into account factors such as the current system load and scheduling cycle.

From 0 to 1

Seeing this, you may have a question: what can be confirmed is the situation where traffic is received when there are instances in the Revision. But what happens if the Revision instances are scaled down to 0? Is the traffic directly discarded or is it temporarily stored somewhere?

To solve this problem, Knative specifically introduces a component called Activator to handle traffic buffering and proxy load balancing.

Let’s take a look at this picture. In fact, when the AutoScaler scales down the function instances to 0, it controls the Activator as the traffic entry point when the instances are 0, which is represented by the red line in the figure. After receiving the traffic, the Activator temporarily caches the requests and information, and actively informs the AutoScaler to scale up until it successfully scales up the function instances. Only then will the Activator forward the cached traffic to the newly generated function instances.

This brings up a question: when will the traffic switch to the Activator or Knative’s Pod? We will discuss such issues in the section on traffic forwarding. Of course, you can also pay attention in advance to the relevant settings of key components and parameters such as Serverless Services (SKS) and Target Burst Capacity (TBC).

Design Ideas for Scaling Model #

By now, we have learned about the typical scaling mechanisms of Serverless from two different dimensions: Node and Pod. So, if you were to design a scaling system, how would you implement it?

The Node and Pod solutions surely gave you some inspiration. Below, I will give you a general design idea from the three core points: metrics, decision-making, and quantity.

Metrics

Whether it is at the Node or Pod level, we find that the system ultimately scales based on some monitored metrics. Moreover, these metrics are usually collected through periodic polling.

For example, in the Node scaling case, the number of idle or occupied Nodes is used. In HPA, CPU utilization and memory usage are used as the metrics for Pod scaling. Knative uses a Collector to track the concurrency of Pods for scaling decisions.

Therefore, selecting the right metrics is the primary factor for scaling. You need to determine which metrics to collect based on your platform’s characteristics and use them through some means.

Decision-making

After obtaining the metrics, a specialized module (such as AutoScaler) is used to make decisions based on these metrics. The decision-making process takes these metrics as input and outputs whether to scale up or down, and by how much.

The decision-making process can be entirely up to you. For example, you can scale proportionally, in fixed quantities, or based on the average of metrics after an operation.

You can also set different scheduling methods based on the system conditions, such as scaling up promptly for sudden traffic spikes or adopting delayed scaling for smooth traffic fluctuations. Knative has two modes: Stable mode and Panic mode.

Quantity

Once the decision is made, the rest is simple. You can adjust the resource quantity according to the result. This step does not require your intervention at all; whether it’s at the Node or Pod level, Kubernetes HPA takes care of it.

Introducing Predictive Capability to Services #

Finally, do you still remember the scenario we mentioned earlier about log processing? The log traffic fluctuates significantly, such as a decrease in traffic after 11 pm and an increase in traffic after 6 am. In order to handle sudden traffic surges more smoothly, we can predict and pre-scale the system, making scaling more gradual.

Let’s continue with the Node-level scaling as an example. We can add a predictive module on top of the scaling module as a bypass. It periodically collects information about Node status and quantity and performs periodic predictions based on historical information. Of course, historical information, prediction window, and prediction period all require extensive data training and human intervention to work effectively in a production environment.

However, we should also note that this is not a panacea, and it is not always accurate in all situations. It can only provide some assistance within a certain threshold.

At the same time, this is a process of balancing performance and cost, especially for cloud providers. Achieving optimal cost optimization is the goal pursued by every developer.

Summary #

Lastly, let’s summarize what we have learned today.

In this lesson, we started with two common scenario cases and introduced the advantages of Serverless core feature scaling compared to traditional PaaS services. This advantage is not only reflected in resource utilization but also in cost reduction.

We then discussed the commonly used scaling mechanisms in the market from the perspective of practical applications and historical evolution. We focused on Kubernetes HPA and the popular Serverless engine Knative’s KPA mechanism. We explained the principles of scaling from the aspects of collecting traffic metrics, adjusting instance data, and the process of scaling from 0 to 1.

The purpose of the above explanations is to abstract a set of design ideas for building a scaling system. The key points to grasp are metrics, decision-making, and quantity.

Lastly, if we are providing services from a platform’s perspective, we can make them more intelligent by introducing predictive systems to make scaling even smoother and more optimal.

Discussion Question #

Alright, we’ve come to the end of this lesson, and I left you with a discussion question.

Think about it, are there any better methods to avoid frequent scaling and ensure stable resource supply?

Feel free to write down your thoughts and answers in the comments section. Let’s exchange and discuss together. Thank you for reading, and feel free to share this lesson with more friends to progress together.