58 Q& a How to Perform Performance Analysis of Container Cold Starts

58 Q&A How to Perform Performance Analysis of Container Cold Starts #

Hello, I’m Ni Pengfei.

Since the update of this column, we are coming to the end of the final part - the comprehensive case module. I’m glad to see that you have been actively learning, thinking, practicing, and enthusiastically sharing your analysis and optimization methods for various performance problems you have encountered in real environments.

Today is the sixth Q&A session for performance optimization. As usual, I’ve selected some typical questions from the comments in the comprehensive case module to answer in today’s session. They are not strictly arranged in the order of the articles for your convenience. For each question, I have included a screenshot of the comment. If you need to review the original content, you can scan the QR code at the bottom right of each question.

Question 1: Analysis of Container Cold Startup Performance #

In the article Why does containerization slow down application startup?, we analyzed the problem of slow application startup caused by containerization. To briefly review the case, Docker sets memory limits for containers using Cgroups, but the container is not aware of this and still allocates too much memory, resulting in the system killing it due to Out of Memory (OOM).

Based on this case, Tony had a more in-depth question.

We know that containers bring great convenience for managing applications, and new software architectures such as Serverless (focuses on application runtime rather than servers) and FaaS (Function as a Service) are also built on container technology. Although container startup is already fast, the startup time for a new container, or cold startup, is still relatively long compared to the performance requirements of the application.

So, how should we analyze and optimize the performance of cold startup?

The most crucial point of this question is to understand where the startup time is spent. Generally, the startup of a Serverless service includes:

Event trigger (e.g., receiving a new HTTP call request);
Resource scheduling;
Image pulling;
Network configuration;
Application launch, and so on.

The time consumed by these processes can be monitored through distributed tracing, which can then help identify the process with the longest execution time.

Next, for the process with the longest execution time, we can use application monitoring or dynamic tracing methods to pinpoint the most time-consuming submodules, thus identifying the bottleneck that needs to be optimized.

For example, in the image pulling process, we can reduce the image pulling time by caching hot images. In the network configuration process, we can accelerate it by preallocating network resources. And for resource scheduling and container startup, we can optimize them by reusing pre-created containers.

Question 2: What is the difference between CPU Flame Graph and Memory Flame Graph? #

In the case of high CPU utilization of kernel threads, we used perf and flame graph tools to generate a dynamic vector graph of the hot kernel function call stack, and identified the most frequently executed kernel functions when the performance problem occurred.

Since the focus of the case analysis is on the CPU’s busyness, the flame graph generated at this time is called an on-CPU flame graph. In fact, in addition to this, there are also off-CPU, memory, and other flame graphs that respectively represent the CPU’s blocking and memory allocation and deallocation.

Therefore, Li Xiaoyao raised a very good question: although both are flame graphs, what are the differences in the data generated between CPU flame graphs and memory flame graphs?

This question hit the nail on the head. The biggest difference between CPU flame graphs and memory flame graphs is actually the source of data, that is, the function call stack is different, but the format of the flame graph remains the same.

For CPU flame graphs, the collected data mainly consists of functions that consume CPU.
For memory flame graphs, the collected data mainly consists of memory management functions such as allocation, deallocation, and page swapping.

To give an example, when using perf record, the default sampling event is cpu-cycles, which collects on-CPU data and generates a CPU flame graph. By changing the sampling event to page-fault with perf record -e page-fault, it is possible to collect data on memory page faults, and the resulting flame graph naturally becomes a memory flame graph.

Question 3: What should I do if perf probe fails? #

In the article How to Use Dynamic Tracing, we have learned how to use dynamic tracing tools like perf and bcc through several examples. These dynamic tracing methods allow us to dynamically understand the execution process of applications or the kernel without modifying code or restarting services. This is particularly useful for troubleshooting complex and hard-to-reproduce issues.

When using dynamic tracing tools, the hexadecimal function addresses may not be easily understood. In this case, we need to rely on debugging information to convert them into more intuitive function names. As I mentioned multiple times for the kernel, debuginfo needs to be installed. But what should we do for applications?

There are actually two methods.

The first method is to directly install the debugging information package if the application provides one. For example, for the bash in our case, you can install its debugging information using the following command:

# Ubuntu
apt-get install -y bash-dbgsym

# CentOS
debuginfo-install bash

The second method is to recompile the application from source code and enable the debugging information switch of the compiler. For example, you can add the -g option to gcc.

Question 4: Monitoring Microservices Applications with the RED Method #

In the comprehensive approach to system monitoring described in the article System Monitoring, I introduced the commonly used USE method for monitoring system resource performance. The USE method simplifies the performance metrics of system resources into three categories: utilization, saturation, and errors. When any of these categories of metrics are too high, it indicates potential performance bottlenecks in the corresponding system resources.

However, these metrics are obviously not suitable for monitoring applications. Because the core metrics of an application are the number of requests, the number of errors, and the response time. So what should we do? This is actually the RED method mentioned by Adam in the comments.

The RED method is a monitoring approach proposed by Weave Cloud for monitoring microservice performance in conjunction with Prometheus monitoring. It focuses on monitoring the number of requests (Rate), the number of errors (Errors), and the response time (Duration) for microservices. Therefore, the RED method is suitable for monitoring microservices applications, while the USE method is suitable for monitoring system resources.

Question 5: Methods for diving into the kernel #

When locating performance issues, we often use various methods such as perf, eBPF, and SystemTap to troubleshoot and may find that the hotspots of the problem lie in certain functions within the kernel. The question for Qing Shi and Xfan is how to understand and delve into the principles of the Linux kernel, especially to figure out the meaning behind the kernel functions displayed by performance tools.

In fact, the best way to understand the meaning of kernel functions is to search through the source code of the kernel version used. Here, I recommend using the website https://elixir.bootlin.com. The usage is quite simple. Select the kernel version on the left side, and then search for the kernel function by its name.

The reason why I recommend this website is that it not only allows you to quickly search for function locations, but also provides the functionality of quickly jumping to the definitions of all functions, variables, macro definitions, and more. This means that when encountering unfamiliar functions or variables, you can click on them and jump directly to their corresponding definitions.

In addition, for eBPF, besides understanding it through the kernel source code, I recommend starting with the BPF Compiler Collection (BCC) project. BCC provides many short examples that can help you quickly understand the working principle of eBPF and become familiar with the development approach for eBPF programs. Once you understand these basic usage patterns, diving into the internals of eBPF will be much easier.

Today, I mainly answered these questions. I also welcome you to continue writing your questions and thoughts in the comment section. I will continuously engage in communication with you there. I hope that with each Q&A session and communication, we can transform various knowledge in this column into your abilities together.