54 How Tos General Thoughts on Application Monitoring

54 How-tos General Thoughts on Application Monitoring #

Hello, I’m Ni Pengfei.

In the previous section, I taught you how to monitor system performance using the USE method. Let’s do a quick recap.

The core of system monitoring lies in the usage of resources. This includes hardware resources such as CPU, memory, disk, file systems, and network, as well as software resources such as file descriptors, connections, and connection tracking. The simplest and most effective method to describe the bottlenecks of these resources is the USE method.

The USE method simplifies the performance indicators of system resources into three categories: utilization, saturation, and errors. When any of these indicators in any category is too high, it indicates a potential performance bottleneck in the corresponding system resource.

After establishing performance indicators based on the USE method, we also need a comprehensive monitoring system to collect, store, query, process, alert, and visually display these indicators. This way, not only can we quickly expose the bottlenecks of system resources, but we can also use historical monitoring data to trace and pinpoint the root cause of performance issues.

In addition to monitoring system resources discussed in the previous section, it is also essential to monitor the performance of application programs. Today, I will show you how to monitor the performance of application programs.

Metrics Monitoring #

Just like system monitoring, before building a monitoring system for an application, it is necessary to determine which metrics to monitor. In particular, it is important to understand which metrics can be used to quickly identify performance issues in an application.

The USE method, which is simple and effective for monitoring system resources, may not be suitable for monitoring applications. For example, even when the CPU usage is low, it does not mean that the application does not have performance bottlenecks. This is because the application may experience slow response due to locks or RPC calls, among other reasons.

Therefore, the core metrics of an application are no longer the usage of resources, but rather request volume, error rate, and response time. These metrics not only directly affect the user experience, but also reflect the overall availability and reliability of the application.

With these three key metrics of request volume, error rate, and response time, we can quickly determine whether the application is experiencing performance issues. However, these metrics alone are clearly not enough. After a performance issue occurs, we also want to quickly identify the “performance bottleneck”. Therefore, in my opinion, the following metrics are also essential when monitoring an application:

Firstly, the resource usage of the application process, such as CPU, memory, disk I/O, and network. Using too many system resources can lead to slow application response or increased error rate, which is a common performance issue.

Secondly, the calling patterns between different application components, such as calling frequency, error rate, and latency. Since applications are not isolated, if the performance of other applications they depend on is poor, it will also affect the performance of the application itself.

Thirdly, the runtime status of the core logic within the application, such as the execution time of key steps and any errors that occur during execution. Since this is the internal state of the application, it is usually not directly accessible from the outside. Therefore, when designing and developing an application, these metrics should be exposed so that the monitoring system can understand its internal operation.

With the metrics for resource usage of the application process, you can associate the bottleneck of system resources with the application, thereby quickly pinpointing performance issues caused by insufficient system resources.

With the metrics for calling patterns between application components, you can quickly analyze which component in the call chain of a request processing is the root cause of performance issues.
With the performance metrics of the core internal logic of the application, you can go further and directly enter the internals of the application to identify which function in the processing phase is causing the performance issue.

Based on these ideas, I believe you can build performance metrics that describe the running status of an application. Then, integrate these metrics into the monitoring system mentioned in the previous article (such as Prometheus + Grafana), to not only promptly report the issues to the relevant teams through an alerting system, but also dynamically display the overall performance of the application through an intuitive graphical user interface.

In addition, since business systems usually involve a series of interconnected services, forming a complex distributed call chain, you can use various open-source tools such as Zipkin, Jaeger, and Pinpoint to quickly locate performance bottlenecks across applications and build end-to-end tracing systems.

For example, the following image is an example of a Jaeger call chain tracing.

- (Image from Jaeger documentation)

End-to-end tracing helps you quickly identify the root cause of problems in a request processing. For example, from the above image, you can easily see that this issue was caused by a Redis timeout.

In addition to helping you locate performance issues across applications, end-to-end tracing can also generate call topology diagrams for online systems. These intuitive diagrams are especially useful when analyzing complex systems (such as microservices).

Log Monitoring #

Monitoring performance metrics allows you to quickly identify bottleneck locations, but metrics alone are often not enough. For example, the same interface can cause completely different performance issues when different parameters are passed in. Therefore, in addition to metrics, we also need to monitor the contextual information of these metrics, and logs are the best source of this context.

In comparison,

Metrics are numerical measurement data for specific time periods, usually processed in time series format, suitable for real-time monitoring.
Logs, on the other hand, are entirely different. Logs are string messages at a certain point in time, usually requiring indexing by a search engine before they can be queried and summarized for analysis.

For log monitoring, the most classic approach is to use the ELK technology stack, which consists of Elasticsearch, Logstash, and Kibana.

The following diagram shows a classic ELK architecture:

- (Image from elastic.co)

In this architecture,

Logstash is responsible for collecting logs from various log sources, then performing preprocessing, and finally sending the preprocessed logs to Elasticsearch for indexing.
Elasticsearch is responsible for indexing logs and provides a complete full-text search engine, making it easy for you to retrieve the desired data from the logs.
Kibana is responsible for visualizing and analyzing logs, including log search, processing, and displaying stunning dashboards.

The following image is an example of a Kibana dashboard, which provides an intuitive overview of Apache’s access logs.

- (Image from elastic.co)

It is worth noting that Logstash in the ELK stack consumes a relatively large amount of resources. Therefore, in resource-constrained environments, we often use Fluentd, which consumes fewer resources, to replace Logstash (also known as the EFK stack).

Summary #

Today, I have outlined the basic approach to application monitoring for you. Application monitoring can be divided into two main parts: metric monitoring and log monitoring.

Metric monitoring mainly involves measuring performance metrics over a certain period of time, and then processing, storing, and alerting through time series analysis.
Log monitoring provides more detailed contextual information and is usually collected, indexed, and graphically displayed using the ELK stack.

In complex business scenarios involving multiple different applications, you can also build a distributed tracing system. This allows for dynamic tracing of the performance of each component in the call chain, generating a call topology diagram of the entire process, thus accelerating the discovery of performance issues in complex applications.

Reflection #

Finally, I’d like to invite you to discuss how you monitor the performance of your applications. What performance metrics do you typically monitor and how do you set up a tracing and logging monitoring system to identify application bottlenecks? You can summarize your thoughts based on what I’ve discussed.

Feel free to discuss with me in the comments section, and please share this article with your colleagues and friends. Let’s practice in real scenarios and improve through communication.