13 Observable Metrics How to Construct a Multi Dimensional Monitoring System for Serverless

13 Observable Metrics How to construct a multi-dimensional monitoring system for Serverless #

Hello, I’m Jingyuan.

Today I want to share with you the topic of observability, another advanced capability that a function computing platform must have.

We all know that quickly identifying online failures is a problem that almost all developers face. In function computing, a production environment application can include dozens or even hundreds of cloud functions. Combined with the statelessness and transparent underlying resources of functions, these factors increase the difficulty of troubleshooting for users.

Therefore, observability is undoubtedly another pair of eyes for cloud service platforms that are more business-oriented. It can help operation and maintenance personnel and developers quickly troubleshoot failures and locate problems.

Perhaps you have encountered the following problems during function development:

Why did my function’s response time suddenly become so long?
Why can’t my function be accessed suddenly?
Why does my function keep returning 4xx and 5xx errors?
Why does my function keep timing out?
…

With these questions in mind, I will use the three pillars of observability, metrics, logs, and traces, to teach you about the construction of the observability system in the Serverless function computing platform from different perspectives.

I hope these two lessons will give you some understanding of the series of problems that may arise during the user development and platform operation and maintenance processes in function computing, as well as the corresponding solutions. By building a multi-dimensional Serverless monitoring system, we can help businesses run faster and more stably. And all of our initial questions will be answered one by one.

The Importance of Observability in Serverless #

In the realm of serverless function computing, apart from the development of functions, the operation and maintenance tasks of the function computing platform itself can also expose various problems. I categorize these problems into two main categories: function execution and platform operation.

For these two categories of problems, we often cannot immediately find the root cause of the problem based on the result returned by the function or the apparent symptoms of the problem. Instead, we need to dig deeper with each layer of the invocation chain.

It can be said that the invisibility of resources makes the function computing platform appear more like a “black box” from the user’s perspective. After the user sends a request to the “black box” and receives the execution result, they do not know how the request is scheduled to the function instance or how the code is actually executed. Problems like code exceptions, execution timeouts, or exceeding concurrency limits, represented by the series of issues in the image, can only be obtained through the output of the “black box”.

On the other hand, the function computing platform involves interactions among multiple control services such as scheduling, scaling, and metadata management. The function instances are distributed discretely in a massive Kubernetes cluster, and there is a possibility of single node failures. In addition, the downstream services it depends on may also be unstable. This complex architecture makes it more difficult to troubleshoot issues that occur online.

All these problems highlight the importance of observability in the function computing platform. So, how should a multi-dimensional observability system for function computing be constructed?

We can refer to the observability approach in cloud native, which includes providing monitoring metrics, logs, and tracing data. Then, based on the actual needs of developers and platform operators, we can construct a multi-dimensional observability system.

It is important to note that when constructing the observability system, we should also consider the characteristics of the function computing platform. For example, on the user side, developers are more concerned with the business-level processing flow and do not care about which step the request is stuck at inside the service, as would be the case when using other PaaS platforms. Therefore, the observable data on the user side should be decoupled as much as possible from the details of the scheduling layer. On the platform side, we also need to consider the complex cluster environment and the relationships between multiple services.

Overall Solution #

The following mind map summarizes the key points of building an observable system for Serverless function computing. Combining the two perspectives of users and platforms, we have listed the key points to consider when constructing a monitoring system. We usually need to consider the three major data pillars of observability: metrics, logs, and traces when thinking about different scenarios of function computing.

Next, I will guide you through the process of building an observable system for function computing based on the entire outline.

Metrics #

When there are problems with a service, the overall situation of request errors is of primary concern to both users and operations. The fastest way to understand the overall situation is through monitoring metrics.

Defining Metrics #

To observe the overall running status of a service through monitoring metrics, the first problem we face is defining the types of metrics. For users using serverless computing, they are more concerned with business logic. Therefore, commonly used user-side function monitoring metrics will include the following:

Total number of function invocations;
Average execution time of functions;
Number of occurrences of critical errors (timeout, exceeding concurrency limits, 5xx system failures, etc.);
Resource usage of function executions;
…

On the other hand, for the platform side, we are more interested in whether the function computing service is working properly and whether the underlying cluster resources are healthy. Therefore, system monitoring on the platform side usually focuses on the following aspects:

Total number of requests processed by the service;
Total number of failed requests processed by the service;
Number of successful/failed instances during each scaling operation;
Overall trend of the total number of cluster nodes;
Resource usage of cluster nodes;
Resource usage of service nodes;
…

Collecting and Reporting Metrics #

Once the types of metrics are determined, we can begin collecting and reporting the metrics. There are already many good solutions available in the industry for this part of the process. Next, I will use Prometheus, a relatively mature open-source monitoring tool, as an example to explain. For the installation process of Prometheus, you can refer to the official documentation for setup instructions.

If you have some understanding of Prometheus, you should also know that its basic working principle is to continuously fetch data from exporters (metric reporters). For resource metrics types such as CPU, Disk, and Memory, you can use the node exporter provided by Prometheus to collect them. Additionally, there are mature high availability solutions available for cluster and single-node deployments.

Here, we will mainly focus on business metrics. Taking the function invocation count as an example, we will use the SDK provided by Prometheus to simulate a simple metric reporting and see how the custom metric data is displayed on the monitoring dashboard.

package main
 
import (
    "net/http"
    "time"
 
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)
 
func main() {   
    // Create a custom metric named "function_x_invocation_times"
    // to simulate the function invocation count
    invocationTimes := prometheus.NewGaugeVec(
        prometheus.GaugeOpts{
            Name: "function_x_invocation_times",
            Help: "count the function x of invocation times",
        },
        []string{`functionName`},
    )
    // Register the metric to the registry
    registry := prometheus.NewRegistry()
    registry.MustRegister(invocationTimes)
 
    // Function invocation count simulator
    mockFunction := func(functionName string) {
        for {
            invocationTimes.WithLabelValues(functionName).Set(float64(100 + rand.Intn(10)))
            time.Sleep(time.Second * 2)
        }
    }
    // Simulate the invocation of two functions
    go mockFunction("functionA")
    go mockFunction("functionB")

// Start the custom Exporter
http.Handle("/metrics", promhttp.HandlerFor(
    registry,
    promhttp.HandlerOpts{Registry: registry},
))
http.ListenAndServe(":8080", nil)

As shown in the above code, we create a custom metric called function_x_invocation_times to simulate the number of function invocations and write it to the Prometheus registry. Then, we simulate the function requests by starting two goroutines to simulate two function invocations and start a custom Exproter for the collector to collect.

After the program is started, accessing the /metrics path on the local http://localhost:8080 will display the reported number of function invocations.

Next, we need to retrieve the data from the /metrics interface and send it to Prometheus. However, before starting the collector, some simple configurations need to be made. We can add a function metric Exporter to the prometheus.yml file.

scrape_configs:
  - job_name: "serverless"
    static_configs:
      - targets: ["localhost:8080"]

After completing the configuration, we can start Prometheus by executing the binary file. We can then log in to the default address http://localhost:9090. After entering the main interface, we can query the function invocation information that was just uploaded.

After the monitoring data is collected, it needs to be stored for downstream services to use. This type of time-series data, which has a temporal relationship, is typically stored in a time-series database (TSDB). TSDB is specifically designed to address the problems of frequent and diverse data generation and the strong dependence on time order in data persistence.

Prometheus actually has a built-in TSDB, which is sufficient for scenarios with small data volumes. However, if you need to build a production-grade observability system, I strongly recommend using an external specialized time-series database.

In addition, in the context of function computing, since the instances of a single function may be distributed across any node in the cluster, relying solely on the Exporter deployed through the DaemonSet method for reporting information is not very efficient. This is because in addition to collecting, Prometheus also needs to aggregate metrics for the same function from different nodes. Once the number of requests surges, the pressure on the aggregation service becomes very high.

To make the collection work of Prometheus more stable, we can store the metrics in a message queue after each request is processed and summarize the metrics in advance through consumption. After the summary is completed, the metrics are reported to Prometheus for a second-round aggregation. With the aforementioned persistence solution, a lightweight monitoring architecture for function computing can be established. The schematic diagram is as follows.

You will notice that the Scheduler mentioned in the diagram refers to the traffic scheduling module mentioned in the scaling and autoscaling courses. It can report function metrics to Kafka through the return path of requests. Afterwards, the data is consumed by the Function Exporter and a first-round aggregation is performed on the data generated by different nodes. After the aggregation is completed, the metrics are sent to Prometheus for a second-round aggregation, and finally the data is written to the TSDB.

As you can see, in addition to obtaining metrics from the Function Exporter, Prometheus may also obtain resource metrics from the Node Exporter on each node. This is the reason why “the more metrics increase, the greater the pressure on the aggregation service” that I mentioned earlier.

Finally, I would like to emphasize that in small-scale clusters, this is indeed a relatively lightweight solution. However, in cases where the number of cluster nodes is large, I recommend configuring Prometheus separately for business metrics and resource metrics.

In addition, if you are using Knative as the serverless engine as mentioned in earlier sections, the OpenCensus protocol used internally by Knative exposes trace and metrics data, making collection easier. You can configure the istio-ingressgateway, Knative Activator, and queue-proxy to upload the data. For user container instances, if Java is used, you can also report the data non-intrusively using the Java client otel-javaagent provided by Opentelemetry.

Displaying the Monitoring Dashboard #

After collecting the function metrics, how can we visualize and clearly observe the changes in the data? Next, I will use Grafana to create a monitoring dashboard for the function invocation count.

After installing Grafana, we first need to configure the data source for Grafana.

Click on “Configuration->Data sources->Add data source” on the left-hand menu bar. Then, select Prometheus as the configured data source and click “Save & Test” at the bottom. When the prompt “Data source is working” appears, it means that Grafana can connect to Prometheus normally.

Next, we create a monitoring panel. Select “Create->Dashboard->add panel->add a new panel” on the left-hand menu bar to enter the new panel. By establishing a query and entering function_x_invocation_times{functionName="functionA"}, we can display the reported information about the number of function invocations for functionA.

In the same way, we can also see the number of invocations for functionB.

At this point, we have completed the monitoring of a simple function invocation count.

I believe that you now have a clear understanding of the process of reporting, collecting, and displaying function computing business metrics. For metrics such as execution time, QPS, and scaling quantity, the implementation of monitoring is similar. You can try implementing them yourself, and I welcome you to interact with me in the comments.

Finally, let’s add a detail. If there is an issue in a production environment, such as the function’s return results not meeting expectations, or if the scaling quantity is consistently 0 for a certain period of time, we can also leverage the time range selection function of the visualization tool. When selecting a time range, it is better to first increase the time granularity. For example, we can first observe the overall situation within a day or a week to identify the approximate time period when the exception occurred, and then conduct fine-grained troubleshooting based on the time range near the time key point. This can effectively improve troubleshooting efficiency.

Summary #

Through the organization of the content in this section, we have gained a certain understanding of the importance and solutions of observability in the form of FaaS Serverless.

Under the architecture of Serverless, the construction of observability is more difficult than that of traditional microservice architectures due to the complexity of black box scheduling and cloud component integration. We not only need to have a basic understanding of how to build observability, but also need to be very familiar with Serverless products themselves.

We have also organized one of the three data pillars, “metrics.” First, we need to determine the types of metrics and become familiar with the commonly used user-side and platform-side function monitoring metrics. When collecting and aggregating metrics, we should also pay attention to factors such as the scale of aggregation, storage format, and collection tools. Finally, the collected metrics should be utilized. We can turn the collected metrics into a monitoring dashboard to maximize the use of data and provide data support for possible optimization actions in the future.

Alright, this lesson will end here. In the next lesson, I will talk to you in more detail about the content of logs and traces, the other two parts of the three data pillars.

Reflection Questions #

Think about it, how should logs and traces be handled? Can they be integrated into one architecture to collect information?

If you have time, you can practice collecting metrics based on today’s example.

Feel free to share your thoughts and answers in the comments. Let’s discuss and learn together. Thank you for reading, and feel free to share this lesson with more friends for further discussion and learning.