53 How Tos Comprehensive Thoughts on System Monitoring

53 How-tos Comprehensive Thoughts on System Monitoring #

Hello, I’m Ni Pengfei.

In the previous sections, I introduced you to the principles, ideas, and related tools of performance analysis. However, in actual performance analysis, a common phenomenon is that even though there is a performance bottleneck, when you log in to the server to troubleshoot, you find that the bottleneck has disappeared. Or in other words, performance issues occur from time to time, but it is difficult to identify the patterns and reproduce them.

When faced with such scenarios, you may find that all the tools and methods we introduced earlier have “failed”. Why is that? Because they are only effective when the performance issues occur, and in these post-analysis scenarios, we cannot fully utilize their power.

So what should we do? Ignore it? In the past, many applications waited until users complained about slow response or system crashes to discover performance problems in the system or application. Although the problems were eventually found, this method is obviously not desirable as it seriously affects the user experience.

To solve this problem, we need to build a monitoring system to monitor the operating status of the system and applications, and define a series of strategies to alert and notify when problems occur. A good monitoring system not only exposes various problems in real time but also automatically analyzes and identifies the rough source of bottlenecks based on the monitored status, so that the issues can be accurately reported to the relevant teams for resolution.

To achieve effective monitoring, the most crucial aspect is comprehensive and quantifiable metrics, which consists of both system and application aspects.

Regarding the system, the monitoring system should cover the overall resource utilization, including various system resources such as CPU, memory, disk and file systems, and network, as we mentioned before.

As for the application, the monitoring system should cover the internal running status of the application. This includes not only the overall running status of processes such as CPU and disk I/O but also internal running conditions such as the duration of interface calls, errors during execution, and memory usage of internal objects.

Today, I will show you how to monitor a Linux system. In the next section, I will continue explaining the ideas behind application monitoring.

The USE Method #

Before you start monitoring your system, you definitely want to know how to describe the usage of system resources in a concise way. You can certainly use various performance tools learned in the column to collect the usage of various resources individually. However, don’t forget that each performance metric for a resource can have many values. Using too many metrics not only consumes time and effort, but also makes it difficult to establish the overall operating condition of your system.

Here, I will introduce you to a method specifically designed for performance monitoring, called the USE (Utilization, Saturation, and Errors) method. The USE method simplifies the performance metrics of system resources into three categories: utilization, saturation, and errors.

Utilization represents the percentage of time or capacity a resource is used for servicing. A utilization of 100% indicates that the capacity has been exhausted or all time is spent on servicing.
Saturation represents the level of busyness of a resource, usually related to the length of the waiting queue. A saturation of 100% indicates that the resource cannot accept any more requests.
Errors represent the number of events where errors occur. The more errors there are, the more severe the problem is with the system.

These three categories cover the common performance bottlenecks of system resources, so they are often used to quickly identify the performance bottleneck of system resources. Therefore, whether it is for hardware resources such as CPU, memory, disk, and file system, or for software resources such as file descriptors, connection counts, and connection tracking counts, the USE method can help you quickly identify which type of system resource is experiencing a performance bottleneck.

Now, let’s recall the various performance metrics we have discussed for each type of system resource. It is not difficult to think of the corresponding performance metrics. Here, I have created a table of common performance metrics for your reference.

performance-metrics

However, please note that the USE method only focuses on the core metrics that reflect the performance bottleneck of system resources. This does not mean that other metrics are not important. Various metrics such as system logs, process resource usage, and cache usage also need to be monitored. However, they are usually used for auxiliary performance analysis, while the metrics of the USE method directly indicate the resource bottleneck of the system.

Monitoring System #

After understanding the USE method and the performance metrics that need to be monitored, the next step is to establish a monitoring system to store these metrics. Then, based on the monitored status, automatically analyze and locate the approximate source of bottlenecks. Finally, through the alarm system, promptly report the issue to the relevant team for handling.

As can be seen, a complete monitoring system usually consists of multiple modules such as data collection, data storage, data querying and processing, alarms, and visualization. Therefore, building a monitoring system from scratch is actually a big systems engineering project.

Fortunately, there are already many open-source monitoring tools available for direct use, such as the most common ones, Zabbix, Nagios, Prometheus, and so on.

Now, let’s take Prometheus as an example to introduce the basic principles of these components. The following figure shows the basic architecture of Prometheus:

(Image from prometheus.io)

Let’s first look at the data collection module. The Prometheus targets on the left are the objects of data collection, and Retrieval is responsible for collecting this data. From the figure, you can also see that Prometheus supports both Push and Pull data collection modes.

In the Pull mode, the collection module on the server side triggers the collection. As long as the collection target provides an HTTP interface, it can be freely accessed (this is also the most commonly used collection mode).
In the Push mode, each collection target actively pushes metrics to the Push Gateway (used to prevent data loss), and then the server side pulls the metrics from the Gateway (this is the most commonly used collection mode in mobile applications).

Since the objects that need to be monitored usually change dynamically, Prometheus also provides a service discovery mechanism that can automatically discover the objects that need to be monitored based on pre-configured rules. This is very effective in container platforms such as Kubernetes.

The second module is data storage. To keep the monitoring data persistent, the TSDB (Time series database) module in the figure is responsible for persisting the collected data into devices such as SSDs. TSDB is a database specifically designed for time series data, characterized by indexing based on time, large data volume, and appending data.

The third module is data querying and processing. The previously mentioned TSDB not only stores data but also provides data querying and basic data processing capabilities, and this is what PromQL language does. PromQL provides concise querying and filtering functions, and supports basic data processing methods, serving as the foundation for the alarm system and visualization.

The fourth module is the alarm module. The AlertManager in the upper right corner provides alarm functions, including trigger conditions based on PromQL language, configuration management of alarm rules, and alarm notification. However, although alarms are necessary, it is not desirable to have too many frequent alarms. Therefore, the AlertManager also supports various methods such as grouping, suppression, or silencing to aggregate similar alarms and reduce their quantity.

The last module is the visualization module. Prometheus’s web UI provides a simple visualization interface for executing PromQL queries, but the display of results is relatively monotonous. However, once combined with Grafana, very powerful graphical interfaces can be built.

Now that we have introduced these components, you probably have a clear understanding of each module. Next, let’s further understand the overall functionality of these components when combined.

For example, taking the USE method mentioned earlier, by using Prometheus, I can collect the utilization, saturation, and error metric of various resources such as CPU, memory, disk, and network on a Linux server. Then, through Grafana and PromQL queries, I can visually display them in a graphical interface.

Summary #

Today, I have explained the basic approach to system monitoring.

The core of system monitoring is the usage of resources, including CPU, memory, disk and file system, network, as well as software resources such as file descriptors, connection count, and connection tracking count. These resources can be represented by three key performance indicators using the USE method.

The USE method simplifies the performance indicators of system resources into three categories: utilization, saturation, and errors. When any of these categories is too high, it indicates a potential performance bottleneck in the corresponding system resource.

After establishing the performance indicators based on the USE method, a complete monitoring system is needed to collect, store, query, process, and visualize these indicators, as well as send alerts. You can use various open source monitoring products such as Zabbix and Prometheus to build this monitoring system. This not only quickly exposes the bottlenecks in system resources, but also helps trace and locate issues using the historical data from monitoring.

Of course, besides system monitoring, application monitoring is also indispensable. I will continue to explain this in the next lesson.

Reflection #

Lastly, I would like to invite you to discuss how you monitor system performance. What performance metrics do you typically monitor and how do you set up a monitoring system? How do you identify system resource bottlenecks based on these metrics? You can summarize your own thoughts based on my explanation.

Feel free to discuss with me in the comment section and also share this article with your colleagues and friends. Let’s practice in real scenarios and progress through communication.