09 How to Design Global and Targeted Monitoring Strategies

09 How to design global and targeted monitoring strategies #

Hello, I’m Gao Lou.

Looking back at the development of the software performance industry, over a decade ago, when performance testing first appeared in China, we only focused on tools. Whether it was in training or in work, as long as we mastered the performance testing tool, we could dominate the market. At that time, being able to use LoadRunner meant being able to meet performance standards.

However, even now, as the performance testing industry has developed, we can still see that in many occasions, people are still talking about performance testing theories and mindsets, and still talking about the use and implementation of performance testing tools. Although there is also some data explanation about performance monitoring, most of it only lists the data and describes phenomena such as CPU 90%, insufficient memory, and 100M IO.

As for why it is CPU 90%? How to pinpoint the specific cause? What is the solution? Most performance engineers do not know, and they may not even have an idea. This is the current situation in the industry.

Some time ago, I saw a discussion in a WeChat group. Someone went for an interview for a performance position and was asked, “One night, the CPU of the production database suddenly spiked. How do you locate the cause of the problem?” The group was in a lively debate, with some saying it was due to a fixed batch execution plan, and others saying to look at monitoring data and slow SQL, etc. In short, it was just a bunch of people guessing.

Finally, the interviewer gave the answer directly: because Redis was overwhelmed, causing high pressure on the database, resulting in high CPU usage. Seeing this answer, some people at the time felt that it had no direct logical relationship with the phenomenon described in the question.

Through this incident, we can see that the problem caused by insufficient performance monitoring data is the lack of analytical evidence. And I have always emphasized that there needs to be a complete analysis chain from phenomenon to conclusion. Only in this way can it be considered a true performance analysis. Otherwise, it is just guessing and pretending to analyze performance.

Currently, many enterprises (whether they are internet giants or financial institutions, etc.) seem to have comprehensive monitoring, but in reality, there is no detailed monitoring hierarchy. Not refining the monitoring results in the problem of often needing to rerun scenarios to capture data and temporarily add monitoring tools.

Therefore, based on the above conditions, in today’s class, I want to talk to you about how to design global and targeted monitoring strategies. I hope you can understand the importance of pre-designing monitoring strategies, from a global perspective to a targeted one.

When designing monitoring strategies, our first step is to analyze the architecture. By analyzing the architecture, we need to determine which points need to be monitored.

Architecture Analysis #

Let’s start by listing the machines in the system illustrated in the course.

In the previous Lesson 4, we have already drawn the system architecture as shown below:

From the above information, we need to list the components that need to be monitored, as shown in the following table. Please note that for the instances in the layers above, we have currently only configured one instance. However, it doesn’t mean that we only need one. In the subsequent testing process, we will add more instances as needed.

Selection of Monitoring Tools #

Based on the component list above, we need to choose the corresponding monitoring tools. One thing to note is that this is our first-level monitoring tool, also known as global monitoring tool. We cannot determine the specific tools needed for targeted monitoring yet because it depends on the issues identified during the performance analysis process.

After identifying the issues through global monitoring counters, we need to analyze more detailed monitoring data to pinpoint the specific causes of the problems. However, global monitoring counters cannot provide this level of detail, so we need to choose appropriate targeted monitoring tools to obtain more specific monitoring data, which I call targeted monitoring.

It is not difficult to understand that the difference between global monitoring and targeted monitoring is that global monitoring is the first level of monitoring, which can highlight the key performance of various modules of a technical component, just like the CPU of an operating system is a typical global monitoring counter.

Now let’s see how to choose a global monitoring tool.

Global Monitoring Strategy and Tool Selection

We mentioned earlier that global monitoring is to determine the bottleneck of the entire system, but it cannot provide the specific cause. Based on this, there are several key points to consider when choosing a global monitoring tool:

Accuracy of data: This is very important, because for performance counters, the accuracy of the data directly determines the next steps.
Low cost: Cost includes both fees and labor costs. Whether it is a ready-made paid product, a free product, a self-developed product, or a combination product, the cost is easy to calculate, so I won’t go into too much detail. As for labor costs, we can simply calculate them based on the employees’ salaries and time. If it is a temporary project, I recommend choosing a popular and widely used monitoring tool.
Wide coverage: The monitoring tool should be sufficient to cover all the counters listed in the performance analysis decision tree. If the tool’s capabilities are limited and there is no time to expand, we need to identify which counters cannot be monitored after choosing the right tool. Then, we can use commands to compensate for the tool’s deficiencies in the performance analysis process.
Historical data retention: Real-time viewing of performance data is necessary in performance projects, and the ability to retain historical monitoring data is also crucial. This is because we will use historical data when conducting performance analysis and generating performance reports after the scenario is completed.

Based on these points, we can now select the corresponding global monitoring tools.

According to the architecture of this system, the selected tools should monitor the following levels: the first level, physical hosts; the second level, KVM virtual machines; the third level, the Kubernetes suite; the fourth level, various technical components required by applications.

Therefore, the corresponding monitoring tools are shown in the table below.

All of the above tools are free and open source, and can fully meet our monitoring needs. We just need to deploy them. As for the operating system monitoring tools, we have already mentioned their limitations in the RESAR performance analysis logic in Lesson 4. If you forget, you can review it again.

In our system, both the physical hosts and KVM are complete operating systems, so we can fully cover them using node_exporter mentioned in Lesson 4. However, when it comes to Kubernetes, how can we achieve comprehensive monitoring? This is where the Kubernetes monitoring suite comes in. Now, let’s take a look at a Kubernetes global monitoring suite, as shown below:

There are many similar templates available, and I won’t list them all. Although the tools have different ways of presenting, they all achieve the goal of global monitoring Kubernetes. So, we just need to choose a Kubernetes monitoring suite that is suitable for our business system. In fact, choosing a monitoring suite to meet the monitoring needs of each layer is a challenge in global monitoring. In global monitoring, I have always emphasized the term “layered”. Because in the projects I have been involved in, people often say, “Our monitoring is complete.” But when I take a look myself, I only see operating system-level data, and there is no data about what is running on top of it.

Here is an example from my own experience. I was giving training to a financial institution, and they said there was a problem online and asked me to analyze it. At the same time, they confidently told me, “Our monitoring data is very complete, we just don’t know where the problem is.”

However, when I looked at the data, I found that there was no data at the Java thread level, and their monitoring platform did not support detailed monitoring at the thread level. But from the system data, it was clearly a thread problem. So they had to collect the data again. When the data was collected again, it became clear where the problem was.

This is a typical example of missing global monitoring data, which leads to a broken analysis chain. Therefore, the integrity of global monitoring is a very important part of performance analysis.

Targeted monitoring strategy and tool selection

Once global monitoring is completed, performance scenarios can be executed. However, when we encounter problems, we can only see the counters of the first layer in the global monitoring data, such as high CPU usage, insufficient memory, high I/O, and large network bandwidth. From this information, we cannot know what optimizations to make to reduce CPU usage, decrease memory usage, lower I/O, or reduce network usage.

Therefore, at this point, we must perform targeted monitoring in order to find more detailed evidence. In RESAR performance engineering, I separate the data into global and targeted because performance analysis has a logical chain. If we don’t make a distinction and just look at everything, you will feel that there is a lot of data but you don’t know which ones are key data.

Please note that in my analysis philosophy, global and targeted monitoring must be separated. Because for global monitoring data, we will continue to collect and save it for a period of time, which does not have a significant impact on the overall performance of the system. However, if we continuously collect targeted data, it will affect the overall performance of the system, such as collecting thread stack data, object memory consumption data, etc. These operations actually have an impact on performance, regardless of how perfect the tools claim to be. We have enough data in practice to prove this.

However, many monitoring tools on the market do not distinguish between global monitoring and targeted monitoring. So you can also see the data required for targeted monitoring in the list of global monitoring tools mentioned earlier. For example, when using JvisualVM to monitor Java, we can not only see global information such as CPU, JVM, Class, Thread, but also targeted information such as stack, method, object.

For Java microservices applications, with the tools listed in the table, we can already see more detailed data. Tools like Spring Boot Admin, JvisualVM, and other monitoring tools provided by JDK can achieve method-level and object-level monitoring. If we find any shortcomings during use, we can also consider other targeted monitoring tools.

With the monitoring tools for Spring Cloud microservices, what we need to see in the process of providing services is the business chain. The tools mentioned above for monitoring individual microservices cannot accomplish this.

Therefore, here I use the APM tool SkyWalking for monitoring the chain.

In SkyWalking, we can not only see the chain diagram, but also more detailed data. This diagram is the chain diagram in SkyWalking, and I define it as global monitoring.

The following diagram shows more detailed data seen with the SkyWalking tool, and I define this type of data as targeted monitoring data.

This graph shows an intermediate stage of targeted analysis. From the graph, we can see the time consumption of each segment of a request, such as calling another interface, JDBC, Redis, and other subsequent components. When we find that a segment takes a long time, we can go to the component with long consumption time and continue the analysis based on targeted monitoring data.

Through the above content, we know what data is needed for targeted monitoring. So after analyzing the system architecture, we also need to select the appropriate targeted monitoring tools and prepare all the necessary tools to avoid the situation where there are no tools available when problems occur. However, targeted monitoring is just prepared and not used from the beginning, please remember this.

Here, I have listed the targeted analysis tools that may be used in our example system. I mainly consider covering system-level, code-level, database-level, and cache-level monitoring.

With this, we don’t have to scramble to find tools in subsequent performance analysis work.

Conclusion #

In my logic, the global and directional aspects must be separated. I have emphasized this to you before. Not separating them will lead to waste of resources, and the data we need may also be missing.

Furthermore, please note that the comprehensiveness of monitoring depends directly on the construction of the project-level performance analysis decision tree. In other words, the key is not what tools are used, but whether these monitoring tools cover all the leaves of the performance analysis decision tree.

When choosing monitoring tools, we mainly consider factors such as cost, scope, hierarchy, and continuity of use. Only with a reasonable monitoring strategy and monitoring tools can the performance analysis decision tree truly be implemented, and the search for evidence chain of performance bottlenecks becomes possible.

Finally, I would like to remind you not to think that it is enough to monitor at the level of technical components. It is important to cover the corresponding modules of the technical components and the counters corresponding to those modules. Because in the process of analyzing bottlenecks, we need to find the correlation between counters. If one counter is missing, it will cause the analysis to be interrupted.

Homework #

That’s all for today’s content. I have left you with two questions to reinforce what we have learned today. Please take some time to think about them:

How can you determine whether the performance monitoring tool you have chosen covers the entire performance analysis decision tree?
Why is it not recommended to choose more targeted monitoring and analysis tools?

Remember to discuss and exchange your thoughts with me in the comments section. Each reflection will help you make further progress.

If you have gained something from this class, feel free to share it with your friends and learn and improve together. See you in the next lecture!