04 Cold Start How to Accelerate the First Invocation Process of a Function

04 Cold Start How to accelerate the first invocation process of a function #

Hello, I’m Jingyuan.

In the previous few lectures, we mainly discussed the “benefits” of serverless or how to “make good use” of serverless. But is serverless perfect without any flaws?

With this question in mind, today I will talk to you about one flaw of serverless - “cold start”. This issue is quite important, you know why? Because whether facing private enterprises or customers on public clouds, the first thing that usually comes up is the performance issue related to cold start. Cold start affects the service’s latency and stability, thereby impacting users’ service experience. In this “real-time era,” this is an issue that cannot be ignored.

In this lecture, I will analyze the reasons behind cold start and the process of cold start. From the perspective of both the platform and developers, I will help you understand how to accelerate function startup and master the optimization techniques involved.

What is Cold Start? #

In previous lessons, we learned that when a request is scheduled to a function instance, if the instance has not been recycled since the last execution, the request can simply reuse this instance for code execution. This process is called a warm start.

If the service is requested for the first time, or if the container instance is recycled after a service request, a cold start is triggered. So, how does a cold start work?

What are the steps in a cold start? #

In the industry, different optimization methods used by cloud providers or open-source projects can result in different cold start times, but the basic principles are similar. Let’s take a look at the steps involved in a cold start.

Step 1: Container Creation. This step is usually reflected in the scaling process. When all container instances are busy processing requests, new containers need to be created by requesting them from the cluster.

It’s worth mentioning that serverless computing platforms usually support multiple runtime languages, and these runtimes are generally packaged in images running in a Kubernetes cluster as a DeamonSet. In a cold start, runtime libraries required for execution are dynamically mounted to the corresponding runtime paths based on different parameters. I didn’t include this process in the diagram, as I considered it part of resource scheduling.

Step 2: Code and Layer Dependency Download. This step is usually the most time-consuming part of a cold start. Serverless computing itself does not have the capability to persist code packages and layer dependencies. These are usually retrieved from other storage servers. Therefore, the duration of this step is influenced by factors such as the size of the code package uploaded by the user, and the network speed. Code packages are usually in compressed form, so they need to be downloaded and then extracted.

Step 3: Environment Variables and Parameter File Preparation. Mainstream serverless computing platforms often provide the ability to inject environment variables, which occurs during the cold start phase. Additionally, runtime and container may require preparation of parameter configuration files, but we can temporarily ignore the optimization of this step as it usually takes less time.

Step 4: User VPC and Related Resource Preparation. If the user has connected the function to a private network, some initialization work needs to be done to establish VPC network connectivity for the container. Additionally, if the user is using features such as distributed file systems, they need to be mounted, which also adds extra time.

Step 5: Runtime Initialization. This refers to the startup process of standard runtime environments provided by cloud providers. This step is heavily influenced by the programming language type. For example, Java code startup is slower compared to other languages due to the need to start the JVM.

Step 6: User Code Initialization. Sometimes, this step can be confused with runtime initialization. For example, for compiled languages, the code is already a complete executable program after packaging, so part of the work is already loaded at runtime startup. However, for interpreted languages, we can understand this step as the loading process of the user code package.

Therefore, for this step, we can understand it as the loading of the user’s custom business logic code and the initialization of certain business logic such as database connections outside the Handler, or caching initialization within the Handler method.

At this point, you may wonder if using a public cloud serverless computing platform will incur additional costs for each cold start execution if the code package is large, as it increases the overall latency. In fact, it won’t. As you can see, in the diagram above, I specifically distinguished between platform-side and user-side time consumption. Billing for functions starts from the beginning of program execution, which means it starts from Step 6. Even if the runtime, such as Java, has a slower startup, it only feels slower to you, but the cost remains the same.

What factors affect the duration of a cold start? #

Based on the six steps outlined above, the majority of cold start duration is influenced by the following four aspects:

Resource scheduling and container creation process: Usually completed within seconds.
Code and layer dependency download process: Depends on the code size and whether acceleration is available, typically ranging from milliseconds to seconds.
VPC network establishment process: Mainly the time taken for elastic network interface and route distribution, usually in seconds.
Runtime and user code initialization process: More closely related to the user, depending on the programming language, startup time will vary, generally ranging from milliseconds to seconds.

From the above analysis, we can basically see that optimizing a cold start is not a one-sided effort. Both the platform and the user need to work together, following the logic of “making the slow parts faster.” Next, let’s work together to make the “slow” parts “fast.”

Optimization Measures on the Platform Side #

As an observant reader, you may have noticed that in this diagram, I have labeled the different stages with “Platform Side” and “User Side”. So first, let’s take a look at the optimization measures on the platform side in terms of function instance loading and execution time.

Resource Scheduling and Container Creation #

We know that when a container starts, it usually checks if the relevant image is available locally. If not, it needs to pull from a remote repository. The container environment for functions is typically based on a basic image like Alpine or Ubuntu, while the application layer resource objects are packaged on top of the base image. If remote pulling is required, it can be time-consuming.

In traditional container runtime, the entire image data needs to be downloaded and then extracted, which leads to a longer container startup time. Moreover, if the cluster is large, the downloading and extraction process may cause significant network and disk I/O pressure, resulting in a deployment delay for large-scale container startup. However, in reality, only a portion of the container image data may be used when the container starts.

Image acceleration aims to solve the issue of long startup time caused by the need to download the image before starting the container. In the industry, there are usually two approaches:

On-demand loading: Using an accelerated image version, configuring acceleration rules and tags, and loading the image on demand to avoid the waste of bandwidth and the impact on distribution efficiency caused by downloading the full image.
P2P acceleration: Leveraging the intranet bandwidth resources of compute nodes to distribute images among nodes, reducing the pressure on the image repository.

In addition, major cloud providers also alleviate storage and service pressure on a single repository by enabling multiple image repositories, thus further enhancing resource scheduling and download speed.

Code and Layer Dependency Download #

After preparing the basic resources and containers, let’s take a look at the acceleration methods related to code and layer dependencies. These methods are somewhat related to the users. Here, I will share three commonly used methods with you.

Method 1: Cache as compressed package

Although the platform cannot directly manipulate the user’s code, we can improve the speed of obtaining code and layer dependencies by implementing local caching.

When a function’s instance processes a request for the first cold start, the code is downloaded to the Node and marked with a special encoding. Then, when there are other function instances on the Node that need to use the same code, they can directly mount it to the corresponding path of the function instance based on the code’s mark, thus eliminating the need to download the code package again.

Method 2: Built-in commonly used dependencies

The platform can also pre-build some commonly used dependencies based on the typical usage scenarios of users. For example, dependencies like NumPy and Panda for data processing in Python, and Xpdf for PDF processing.

These commonly used dependencies are usually built together when constructing the container base image. This ensures that these common dependencies can be directly used without the need for additional downloads in any container instance.

Method 3: Extreme compression We can further compress the code package when the user creates a function to make the package smaller. A common method is to use a highly compressed file system like squashfs, which can compress several gigabytes of files into several hundred megabytes.

Acceleration across VPC networks #

From the platform’s perspective, functions run in a dedicated VPC cluster. If our cloud function needs to interact with resources in the user’s VPC cluster, such as Elasticsearch or Redis, what should we do?

The traditional way in the industry is to use dynamic Elastic Network Interface (ENI) creation to achieve cross-VPC access. However, this approach has three obvious drawbacks:

Creating an elastic network interface takes a long time, which increases cold start time.
Elastic network interfaces consume IP resources from the user’s VPC. If there are no available resources, creation will fail.
Elastic network interfaces result in significant waste.

So how can we improve? Currently, I recommend creating a proxy within the cluster’s VPC. The core of this approach lies in the use of IP tunneling technology, which allows the platform to reach the user’s VPC through proxy nodes while maintaining the function’s original target IP address.

To ensure stability, the function platform will create lightweight virtual machines (VMs) and bind the elastic network interfaces to the proxy VMs. All traffic associated with the VPC network will be routed through the proxy.

To ensure service stability, the proxy nodes are usually implemented in a master-slave configuration. Under normal circumstances, the node’s egress traffic is routed through the master proxy for request forwarding. However, if the master proxy fails, the platform switches to the slave proxy for traffic forwarding.

This method can be created in advance by listening to user operations related to VPC association in the console. As a result, the first request processing of the function can directly benefit from the performance improvement through request forwarding.

However, it is important to note that this approach also has some flaws, as it requires additional node resources for the proxy. Now, why don’t you think about how this method is specifically implemented, and later on, I will practice and explain it to you in more detail.

Preloading user instances #

The preloading process is generally related to algorithms and strategies. Based on my frontline experience, I will provide you with some ideas. However, before applying these ideas, you need to train and fine-tune them according to the actual usage and scenarios of your platform to achieve good results. Let’s take a look at some commonly used methods for preloading user code instances.

The first method is based on the scenario of functions calling each other for prediction. You can refer to Tencent Cloud’s function chain call pre-warming method. As shown in the diagram below, function A calls B and C, and we can pre-start functions B and C based on their call topology.

The second method is based on function versions. If the traffic ratio of version 1 is not 100%, we can infer that the other version, X, which occupies the traffic, should also be loaded. This method has also been publicly explained in Tencent Cloud’s function computation.

The third method is to deploy preloaded images into nodes. Especially for large images such as 10GB, they can be pre-distributed and marked through the agent on the single machine’s side. When traffic arrives, it can be directly dispatched to the preloaded node’s container to quickly respond to function execution.

There are many other similar methods. You can further explore the ideas of machine learning and prediction pre-fetching. I am also here to discuss with you at any time.

So far, we have discussed the optimization methods from the platform side. Serverless technology is constantly evolving, so if you are a platform developer, you can check if there are any other methods that can better optimize your function computing platform. Of course, when we optimize, we should not forget about ROI. Operating costs are also a key focus for every platform developer.

Optimization methods on the user side #

If you are a developer using serverless functions to develop business code and do not have access to the underlying infrastructure, is there anything you can do? One option is to choose a cloud provider platform that is feature-rich, high-performance, and cost-effective. Yes, this should be your first consideration. The platform will handle many optimization methods, including those mentioned above, and relieve you of a lot of troubles.

Apart from that, what else can we optimize ourselves? Here are 5 directions for you to consider:

Control the size of code packages appropriately;
Choose high-performance runtimes;
Make reasonable use of reserved instances within a cost-controlled range;
Activate function instances with high sensitivity to latency for scheduled tasks;
Make reasonable use of local caching.

You can take a moment to think about it, and then we can discuss each of these directions one by one.

Controlling the size of code packages #

First of all, the fetching of code packages is the most time-consuming part of the entire cold start process. Even if the package is downloaded from the internal network, if the code package is overly bloated, it will significantly increase the end-to-end response time.

Therefore, before deploying the function, you can manually delete unnecessary files and dependencies during the execution process to make the code package as lightweight as possible.

Choosing high-performance runtimes #

In the analysis of the time-consuming cold start process, we know that the type of programming language will affect the startup speed of the program. Generally speaking, interpreted languages like Python and Node.js tend to have faster execution speeds, while compiled languages like Golang, which form executable binary files, also have fast loading speeds. However, languages like Java, which need to run in a JVM, are relatively slower.

Here, I will use Alibaba Cloud Function Compute platform as an example to show you the difference in language runtime selection with a “Hello World” program. I have chosen Golang and Java for comparison, and the results from the trace link tracing are shown in the following images:

As you can see, the cold start time for Java is 282.342ms, while for Golang it is 67.771ms. The cold start time for Java is 4.166 times that of Golang. Additionally, we can also observe from the invocation process in the call chain that Golang has obvious advantages.

Making reasonable use of reserved instances within a cost-controlled range #

Currently, mainstream public cloud function compute platforms (such as Alibaba Cloud FC, Tencent Cloud SCF, Baidu Cloud CFC, etc.) provide reserved instance functionality.

Users can request a fixed number of reserved instances for a function based on actual needs. Each instance is pre-prepared with the function code and remains in a resident state until it is manually released by the user. This allows the function to be ready to respond to incoming requests in a hot start manner.

Under this mechanism, the function platform will reserve instances for the function based on the user’s request. When a function receives a request, it will be scheduled to a reserved instance and processed in a hot start manner. Only when all the reserved instances of the function are in a working state, will the request be scheduled to a non-reserved instance and executed in a cold start manner.

However, when using reserved instances to solve the cold start problem, you also need to consider the idle cost. Because the platform needs to reserve additional resources for users, there will be additional costs for the users. Therefore, users need to have a reasonable estimate of the expected request volume.

Scheduling tasks for activation #

If you find that the cost of using reserved instances is not very cost-effective, we have another method. You can pre-warm your service through a scheduled trigger, for example, by pre-warming your service once every minute. Since the function platform usually has some time to keep “hot” containers, this method can to some extent accelerate the response of your function.

One thing to note is something that many business customers often ask me: My function involves business processing and cannot be called randomly.

So, how do we handle this? Actually, it can be solved with a simple logical statement.

def handler(event, context):
  if event["exec"] == true:
      doSomething()
  else:
      return

In this example, I differentiate between pre-warm and non-pre-warm requests with an “exec” flag. Of course, with the pre-warming method, some cloud providers have also provided configuration capabilities for scheduled triggering, which further saves you costs. You can also compare this method with the aforementioned “reserved instances” in practical scenarios.

Making reasonable use of local caching #

Finally, in some AI scenarios, there is often a need for data training tasks. Typically, our approach is to fetch data from a remote server to the local machine and then perform the training task. However, we can avoid the delay caused by fetching data every time by adding cache logic, where we can check in advance from the locally temporary disk space whether the data already exists.

Summary #

So far, we have discussed various optimization methods for cold start. Finally, let me summarize today’s content.

Today, starting from the reasons and working principles of cold start, I summarized some core factors causing cold start. Based on these factors, from the perspective of the platform and the user, I also proposed relevant optimization methods. I believe that the experience I have accumulated from working on the front line will definitely be helpful to you.

The process of cold start consists of six parts: container creation, downloading of code and layer dependencies, preparation of environment variables and parameters, preparation of VPC network and related resources, and initialization of runtime and user code.

If you are a platform developer, you can optimize the duration of cold start through four methods:

Optimize resource scheduling and container creation through image acceleration methods: If you are using a Kubernetes scheduling service provided by a cloud platform vendor, you can further parallelize the processing of your container images by configuring preloading scripts.
Cache packages, compress them to the utmost, and embed common dependencies: Reduce the dependencies at runtime for users.
Use proxy mode for VPC network acceleration: VPC network acceleration is usually only used in cross-VPC scenarios. By using IP tunneling technology and proxy mode, the network latency problem that has been criticized for years can be reduced to milliseconds, which is a significant breakthrough in the field of serverless function computing.
Use prediction algorithms for function instances to help users mitigate cold start issues to a certain extent.

If you are a business developer, I have also provided you with five directions in this lesson:

Reasonably control the size of code packages: Here, we need to change the bad habits in traditional coding and minimize the code as much as possible.
Choose runtimes with higher performance: I believe that through the comparison between Golang and Java, you can directly feel the performance difference brought by the choice of programming language.
Reserve instances and perform scheduled warm-ups: These two directions require cooperation between users and platforms and require you to use them flexibly.
Make reasonable use of local cache: For some large data fetching scenarios, you can use caching appropriately to avoid frequent data fetching, which can cause network bandwidth and execution time. This not only affects the performance of your service processing, but also increases cost.

Serverless technology is constantly evolving and updating. I hope that my introduction today can stimulate your thinking, inspire you to think creatively, and come up with more optimization methods.

Reflection #

Alright, that’s the end of this class. Finally, I have a reflection question for you.

In your actual work, what optimization techniques have you used? Or, which scenarios have you encountered that require further discussion and exchange?

Feel free to write down your thoughts and answers in the comment section. Let’s discuss and exchange ideas together. Thank you for reading, and please feel free to share this class with more friends to learn together.