14 Observable Metrics How to Construct a Multi Dimensional Monitoring System for Serverless

14 Observable Metrics How to construct a multi-dimensional monitoring system for Serverless #

Hello, I’m Jingyuan.

In the previous lesson, we discussed the importance of observability in the context of Serverless and the key points for building an observability monitoring system. We also learned about methods for collecting metrics and understood the architectural design and considerations for metric reporting in the FaaS (Function as a Service) model.

In today’s lesson, we will continue to explore two other data pillars of observability: logs and distributed tracing.

Logs #

We know that when operating a system, after getting a general understanding of the problem from the monitoring dashboard, we often look at the logs to find specific error details.

The purpose of logs is to record discrete events and analyze these records to understand the overall behavior of the program, such as key data that has occurred and methods that have been called. In other words, it helps us pinpoint the root cause of issues.

In the Function Compute scenario, we need to consider two types of logs: user logs and system logs. User logs mainly record the process of business flows in user function code. This part of the log information is collected independently on the function level, and users can view relevant log information through the frontend console. System logs, on the other hand, record events that occur on the entire platform and are ultimately aggregated for platform operators or developers to troubleshoot issues.

Log Data Source #

So when and how should logs be printed?

First, we need to specify the log level. Taking system logs as an example, common log levels include Error, Info, and Warn, which respectively represent error logs, information logs, and warning logs. During development and debugging, the Debug type may also be used. We need to set different levels based on the actual execution logic.

Second, when adding logs, we should converge the error information as much as possible, avoiding duplicate log printing. Frequent I/O not only increases the workload of log collection but also affects service performance.

Third, logs should be printed at the entry and exit points of each service module, and method invocations and error information within the service should also be reported through passing, rather than printing a log in every method.

For example, in the scheduling process of Function Compute requests, the scheduling module may involve a series of serial processes such as obtaining metadata, authentication, concurrency control, and obtaining function instance information. Each process may involve method invocations, and any problem occurring at any step will lead to scheduling failure. Therefore, during development, we should add an error-type return value to each method as much as possible. When an error occurs, we only need to return step by step according to the recursive call stack, and finally print a log at the entry point. This effectively reduces the number of duplicate log prints.

In addition, I have two usage recommendations regarding user logs used by function developers.

On the one hand, reduce the use of print statements to control the overall information size. In order to quickly locate problems in the function black box, some developers habitually use print as a debug tool. However, since the platform has certain limitations on the logs of each function execution, unnecessary log printing should be minimized during function development.

On the other hand, pay attention to the Event object. As the basic parameter of function entry, Event carries key information about the request source, and paying attention to it can facilitate subsequent tracing.

Log Collection and Cleansing #

Once we have log data, we can start collecting it.

System log data is written to files in fixed paths, while user logs are usually collected using DaemonSet log components. Therefore, log files within a function instance are usually stored in the mounted path of the node or in persistent volumes within the cluster.

As mentioned earlier, user logs are granular at the function level. To alleviate the pressure on node disks caused by growing log data, each request usually corresponds to a log file, which is reported and its content deleted after the request completes. System logs, on the other hand, can be handled by setting up “scheduled deletion tasks”.

When choosing an open-source log collector, commonly used options include Logstash, Fluentd, Fluent-Bit, and Vector, which all have their own advantages.

For specific comparisons, you can refer to Stela Udovicic’s article in ERA Software’s blog in December 2021, where she points out that it’s difficult to find a perfect log collector, and the choice of the right log collector mainly depends on your specific needs. For example, if you need a log collector with low resource consumption, using Vector or Fluent-Bit is a good choice instead of Logstash, which consumes more resources. If you need a collector without a vendor bias, Fluentd and Fluent-Bit are good choices.

In the construction of the serverless computing platform, we usually combine the abilities of these tools to deploy together. You can also choose according to specific business situations. Here, to facilitate your understanding of their respective advantages, I have created a diagram to show the specific data collection process.

Image

When collecting data, because Fluent-Bit performs well in containerized environments such as Kubernetes clusters, we usually use lightweight Fluent-Bit for overall log data reporting. If it’s cluster log information, it will be deployed in the form of a DaemonSet.

Because Logstash has powerful filtering capabilities but consumes more resources, it cannot be deployed in the entire cluster in the form of a DaemonSet like Fluent-Bit. Only a small number of virtual machine instances need to be deployed, and Logstash can be used for overall data cleansing.

If peak issues need to be considered, such as a significant increase in log volume due to a high request peak at a certain moment, you can also use Kafka for buffering. Finally, Logstash delivers the filtered data to the corresponding storage services.

Log Storage and Retrieval #

Lastly, let’s talk about log storage and retrieval.

Image

Logstash supports various plugins. In addition to supporting multiple data input sources like Kafka and local files, it also supports integration with various data storage services such as Elasticsearch and object storage for the output part.

Among them, Elasticsearch is a distributed, highly scalable, and high real-time search and data analysis engine. When combined with data visualization tools like Kibana, it can quickly search and analyze the logs we upload.

To facilitate quick search using Kibana, we should mark key information in key-value pairs during the log printing stage. This way, we can use key values to filter in Kibana, for example:

{
    "level": "info",                // log level
    "ts": 1657957846.7413378,       // timestamp
    "caller": "apiserver/handler.go:154",    // called code line number
    "msg": "service start",         // key information
    "request_id": "41bae18f-a083-493f-af43-7c3aed7ec53c",
    "service": "apiserver"         // service name
}

Additionally, if you need to trace log files or need to consider expanding long-term data reporting capabilities for the serverless computing platform, you can also have Logstash integrate with an object storage service.

After understanding metrics and logs, let’s take a look at the third type of observable data, link.

In addition to code errors, performance analysis is also essential in some latency-sensitive scenarios, especially for complex architectures with a lot of module interaction, such as Function as a Service (FaaS). This is where the link tracing feature comes in handy.

In the context of FaaS, link tracing not only improves observability of the system, helping system administrators detect and diagnose performance issues to ensure the expected service level, but also helps developers trace the execution process of functions, quickly analyze and diagnose the invocation relationships and performance bottlenecks in the function computing architecture, and improve development and diagnostic efficiency.

Let’s first discuss the acquisition of link information. For users, what matters more is the end-to-end latency. Apart from the execution of the code itself, the rest of the latency mainly occurs in the preparation phase of cold start. Therefore, the platform can provide users with the total latency of the function and the latency of the cold start process by default, including the latency of preparing the code, initializing the runtime, and other steps.

In complex business scenarios, it often involves the invocation between functions, or between functions and other cloud services. In this case, we can provide developers with custom link support to record the link information in the corresponding structure (such as Header) to connect the entire external invocation chain together. The internal invocation chain can also be handled through the SDK using context.

By combining with the built-in link, the platform can effectively help users locate timeouts, performance bottlenecks, and faults involving multiple cloud service associations.

At the platform level, the link can be constructed according to the relationships between modules and the actual system architecture. The overall idea is similar to that of the user side in constructing links. However, it should be noted that more detailed link information is not necessarily better, because link tracing itself requires certain resources. Therefore, it is best to construct based on actual operational needs.

So, how do we trace link information and display it in a user-friendly manner? The commonly used solution is usually based on the standard OpenTelemetry protocol, using the SDK and Otel Agent provided by OpenTelemetry to generate, propagate, and report link spans, and finally collect them through a distributed tracing system (such as Jaeger) and visualize the link topology.

Taking into account the characteristics of function computing, here is a basic architecture diagram of the link tracing feature for your reference. From the diagram, it can be seen that OpenTelemetry can report link information in three ways, including using the SDK directly, Agent Sidecar, and Agent DaemonSet. You can choose one or multiple combinations based on your business needs.

Image

Taking Jaeger as an example, node services can use OpenTelemetry SDK or Agent to report link information, which is then collected by Jaeger Collector, stored in ElasticSearch, and finally displayed and queried by Jaeger-Query.

Now, let’s answer the thought question from the previous lesson: Can metrics data be collected using the same approach? Absolutely! With Otel Agent, we can send the data to backends such as Kafka and Prometheus for storage.

Summary #

Finally, let me summarize the content of these two lessons. In these two lessons, we have been discussing solutions for the observability system under the function computing platform. We can build solutions based on the architecture of monitoring indicators, logs, and traces.

Firstly, monitoring indicators refer to the statistical aggregation of a certain type of information in the system. In the monitoring of the function computing platform, we need to consider not only common resource indicators such as CPU and memory utilization, but also business indicators that users are concerned about, such as function invocation times, error times, and execution time.

Next, let’s talk about log construction. The role of logs is more like “preserving the scene”. Through logs, we can analyze the behavior of programs, such as what methods have been called, what data has been operated, etc. When printing logs, we also need to pay attention to the output of functions at key points.

Finally, trace tracking is to label, pass through, and restore the complete request process. Trace tracking is mainly for troubleshooting and optimization, such as analyzing which part of the call chain, which method is causing errors or blockages, whether the input and output meet expectations, etc. The chain information between services can be transmitted through headers, and the entire data process is actually similar to log collection. However, in terms of form, logs are more like discrete events, while trace tracking is more like continuous events.

Reflection Question #

Alright, this lesson comes to an end here, and I have one reflection question for you.

With the popularity of OpenTelemetry, in principle, using a single library or SDK can automatically collect three types of data, which are then processed by a unified collector. However, in practical applications, there might be limitations imposed by legacy systems and the maturity of OpenTelemetry. How do you handle this?

Feel free to write your thoughts and answers in the comments section. Let’s exchange and discuss together. Thank you for reading, and you are welcome to share this lesson with more friends for mutual learning and discussion.