30 How to Implement Eye Opening Client Side Monitoring in the System

30 How to Implement Eye-Opening Client-Side Monitoring in the System #

Hello, I’m Tang Yang.

In the lifecycle of a project, operational maintenance takes up a significant proportion and is almost as important as project development itself. In the process of system operation and maintenance, timely identification and resolution of issues are the responsibilities of every team. Therefore, when you initially built your vertical e-commerce system, the operations team must have completed the basic monitoring of machine CPU, memory, disk, network, and so on, with the expectation of detecting and addressing issues promptly. You thought everything would go smoothly, but the reality is that the system frequently received complaints from users for the following reasons:

The database master-slave delay has increased, resulting in problems with business functionality.
The API response time has increased, leading to blank pages on the product interface.
There have been a large number of errors in the system, affecting normal user usage.

You should have discovered and addressed these issues promptly. However, in reality, you can only scramble to fix the problems after they are reported by users. It is at this point that your team realizes that in order to quickly identify and locate problems in the business system, it is necessary to establish a comprehensive server monitoring system. As the saying goes, “There are countless paths, but monitoring comes first. If monitoring is not done well, there will be tears.” However, during the setup process, your team has encountered some difficulties:

First of all, how should the monitoring metrics be selected?
What methods and approaches can be used to collect these metrics?
After collecting the metrics, how should they be processed and displayed?

These questions are interconnected and crucial to the stability and availability of the system. In this lesson, I will guide you through solving these problems and building a server monitoring system.

How to Select Monitoring Metrics #

The first question you face when setting up a monitoring system is what kind of monitoring metrics to choose, or what to monitor. Some students may feel confused when setting monitoring metrics for a new system and don’t know where to start. However, there are some mature theories and approaches that you can directly use. For example, Google’s experience summary for monitoring distributed systems includes the “Four Golden Signals”. These signals refer to four metrics that generally need to be monitored at the service level: latency, traffic, errors, and saturation.

Latency refers to the response time of requests. For example, the response time of an interface, accessing databases and caches.

Traffic can be understood as throughput, which is the volume of requests within a unit of time. For example, the volume of requests to access third-party services or message queues.

Errors represent the number of errors that occur in the current system. It should be noted that the errors we need to monitor include both explicit errors, such as response codes 4xx and 5xx when monitoring web services, and implicit errors, such as when a web service returns a response code of 200 but encounters some business-related errors (such as array out of bounds exceptions or null pointer exceptions).

Saturation refers to the degree to which a service or resource reaches its limit (or the utilization of the service or resource). For example, CPU usage, memory usage, disk usage, connection numbers of cache databases, etc.

These four golden signals provide common monitoring metrics. In addition to these, you can also refer to the RED (Rate, Errors, Duration) metric system. This metric system is derived from the four golden signals, where R represents the request rate, E represents errors, and D represents duration (response time) without the saturation metric. You can consider it as a simplified version of a general monitoring metrics system.

Of course, some components or services have unique metrics that require special attention. For example, the database master-slave delay, message queue accumulation, cache hit rate, etc. I have compiled a table of monitoring metrics for common components in high-concurrency systems. It does not include basic monitoring metrics such as CPU, memory, network, and disk, but mainly focuses on business-related metrics. This table is intended to be used as a reference in your actual work.

Once you have selected the monitoring metrics, the next thing you need to consider is how to collect these metrics from the components or services, which is the issue of metric data collection.

How to collect data metrics #

When it comes to the collection of monitoring metrics, we generally choose different collection methods based on different data sources. In summary, there are several types:

First, the Agent is a common way to collect data metrics.

We deploy self-developed or open-source Agents on the servers of data sources to collect data and send it to the monitoring system for collection. When collecting information from the data source, the Agent obtains data based on the interfaces provided by the data source. Let me give you two typical examples.

For example, if you want to collect performance data from a Memcached server, you can connect to the Memcached server in the Agent and send a “stats” command to obtain the server statistics. Then, you can select important monitoring metrics from the returned information and send them to the monitoring server to generate a monitoring report for the Memcached service. You can also identify potential issues with the Memcached server from these statistics. Below are some recommended important status items that you can use as a reference:

STAT cmd_get 201809037423 // Calculate the queries per second (QPS)

STAT cmd_set 16174920166 // Calculate the writes per second (QPS)

STAT get_hits 175226700643 // Used to calculate the hit rate, hit rate = get_hits/cmd_get

STAT curr_connections 1416 // Current number of connections

STAT bytes 3738857307 // Current memory usage

STAT evictions 11008640149 // Current number of items evicted by the Memcached server

If this quantity is too high (like the example value mentioned above), it indicates that the current Memcached capacity is insufficient or there is an issue with the Memcached Slab Class allocation.

In addition, if you are a Java developer, most middleware or components developed in Java can obtain statistics or monitoring information through JMX. For example, in lecture 19, I mentioned using JMX to monitor the queue backlog of Kafka, and you can also use JMX to monitor JVM memory information and GC-related information.

Another important way to obtain data is instrumentation in the code.

The difference between this method and the Agent is that the Agent mainly collects information from the server-side of the components, while instrumentation describes the performance and availability of the components used from the client’s perspective. So how do you choose the instrumentation method?

You can use the aspect-oriented programming method mentioned in lecture 25 of the distributed trace component. You can also directly calculate the time, call volume, and slow request count of calling resources or services in the resource client, and send them to the monitoring server.

Here you need to pay attention to one point: since the number of requests to cache and databases is usually high and can reach tens of thousands per second, if all request durations are sent to the monitoring server without any optimization, the monitoring server will be overwhelmed. Therefore, when instrumenting, we generally do some data aggregation. For example, we can aggregate the total number of requests, response time percentiles, error counts, etc. for the same resource within the past 10 seconds, and then send them to the monitoring server. This greatly reduces the amount of requests sent to the monitoring server.

Lastly, logs are also an important source of monitoring data.

The access logs of well-known servers like Tomcat and Nginx are important monitoring logs. You can use open-source log collection tools to send the data in these logs to the monitoring server. Currently, there are many commonly used log collection tools, such as Apache Flume, Fluentd, and Filebeat. You can choose a familiar one to use. For example, in my project, I would prefer to use Filebeat to collect monitoring log data.

Processing and Storage of Monitoring Data #

After collecting monitoring data, you can process and store them. Before doing so, it is generally recommended to use a message queue to handle the data. The main purpose of this is to smooth out the peaks and valleys and prevent an excessive amount of monitoring data from being written, which could impact the monitoring service.

At the same time, it is common to deploy two queue processing programs to consume the data from the message queue.

One processing program receives the data and writes it to Elasticsearch. The data is then displayed through Kibana. This data is mainly used for querying raw data.

The other processing program is a middleware for stream processing, such as Spark or Storm. They receive the data from the message queue and perform various processing operations, including:

- Parsing data formats, especially log formats, to extract information such as request volume, response time, and request URL.

- Performing various aggregation operations on the data. For example, for Tomcat access logs, you can calculate the request volume, response time percentile values, and non-200 request volume for the same URL within a certain time period.

- Storing the data in a time-series database. These databases are specifically designed for more efficient storage of data with time labels. Since monitoring data typically includes time labels and is sorted in ascending order by time, it is very suitable for storage in a time-series database. Popular choices in the industry include InfluxDB, OpenTSDB, and Graphite. You can choose one that you are familiar with.

- Finally, you can use Grafana to connect to the time-series database and visualize the monitoring data as dashboards, which can be presented to developers and operations team.

At this point, you and your team have completed the entire process of setting up the monitoring system for the vertical e-commerce system. I would like to add a few more points. We have collected a lot of metrics from different data sources, and these metrics generally form the following reports in the monitoring system, which you can refer to in your actual work:

1. Access Trend Report: This type of report integrates web server and application server access logs, showing the overall access volume, response time, error quantity, bandwidth, and other information about the service. It mainly reflects the overall operation of the service and helps you identify problems.

2. Performance Report: This type of report integrates the data collected from resources and dependent services, showing the access volume and response time of the monitored resources. It reflects the overall performance of the resources. When you find a problem in the access trend report, you can first identify which resource or service is causing the issue by looking at the performance report.

3. Resource Report: This type of report mainly integrates the runtime data of resources collected by an agent. When you find a problem with a specific resource from the performance report, you can further investigate the problem in this report to determine whether it is due to abnormal connection count or a decrease in cache hit rate. This allows you to analyze the root cause of the problem and find solutions.

Course Summary #

In this lesson, I introduced you to the process of setting up server monitoring. Here are a few key points you need to understand:

Response time, request volume, and error count are the three most common monitoring metrics. Different components may have some additional specific monitoring metrics that you can use directly when building your own monitoring system.
Agent, instrumentation, and logs are the three most common data collection methods.
The access trend report is used to show the overall operation of the service, the performance report is used to analyze whether there are any issues with resources or dependent services, and the resource report is used to trace the root cause of resource problems. These three reports together constitute your server monitoring system.

In summary, a monitoring system is an important tool for discovering and troubleshooting issues. You should prioritize it and invest enough effort to continuously improve it. Only then can you continuously improve the control over system operations and reduce the risk of failures.