05 Analysis How to Determine if a Problem Is Caused by Page Cache

05 Analysis How to Determine if a Problem is Caused by Page Cache #

Hello, I am Shaoyafang.

In the previous few lessons, we discussed some basic knowledge about Page Cache and how to deal with some issues caused by Page Cache. This lesson, we will talk about how to determine whether the problem is caused by Page Cache.

We know that a problem often involves many modules of the operating system. For example, when the system encounters a problem of high load, it may be caused by Page Cache, or it may be caused by intense lock conflicts resulting in resource contention for physical resources (CPU, memory, disk I/O, network I/O), or it may be caused by design defects in kernel features, and so on.

If we fail to determine the cause of the problem and take measures rashly, not only will we be unable to solve the problem, but it may also lead to other negative effects. For example, if the high load is actually caused by Page Cache, but you mistakenly believe it is caused by the network and then apply network throttling, it may seem like the problem is solved. However, over time, the high load will still occur, and this throttling behavior will reduce the system’s load capacity.

So, when a problem occurs, how do we determine if it is caused by Page Cache or not?

Typical Methods for Analyzing Linux Issues #

There are several typical methods for analyzing issues on Linux. By starting with these basic analysis methods, you can step by step determine the root cause of the problem. These analysis methods can be summarized as follows:

From this diagram, we can see that the Linux kernel mainly exports system information to users through /proc and /sys. When you are unsure of the cause of the problem, you can try to read the system information in these directories to see which indicators are abnormal. For example, if you are unsure if the problem is caused by Page Cache, you can try to read the information in /proc/vmstat to see which indicators have significant changes within a unit of time. If pgscan-related indicators have significant changes, it may be caused by Page Cache, because pgscan represents the memory recycling behavior of Page Cache, and significant changes often indicate that the system is under high memory pressure.

The information in /proc and /sys can indicate the general direction for problem analysis. We can determine whether the problem is caused by Page Cache. However, if we want to perform in-depth analysis to understand how Page Cache causes the problem, we need to master more professional analytical methods and tools such as ftrace, ebpf, and perf.

Of course, the learning cost of these professional tools is relatively high, but you should not avoid learning them just because you find them difficult or costly. Once you have mastered these analysis tools, you will be more proficient at analyzing difficult problems.

To help you find the appropriate analysis tools more conveniently when encountering issues, I borrowed a diagram from Bredan Gregg’s presentation and made some modifications based on my own experience to guide you on how to use these analysis tools:

In this diagram, the overall tracing methods are divided into static tracing (predefined trace points) and dynamic tracing (need to use probes):

If there are predefined trace points for what you want to trace, you can use these predefined trace points directly.
If there are no predefined trace points, you need to see if you can use probes (including kprobe and uprobe) to achieve it.

Because analysis tools themselves may also have some impact on business operations (Heisenbugs), for example, using strace will block the process, and using systemtap will incur compilation and loading overhead, so before using these tools, we also need to understand the side effects of these tools in detail to avoid unexpected problems.

For example, many years ago when I was using systemtap in guru mode, I did not consider that if the systemtap process exited abnormally, it might not unload the systemtap modules, which could lead to system panics.

These are some typical analysis methods for Linux issues. By understanding these methods, you will know which tools to choose for analysis when encountering problems. As for Page Cache, we can first make a rough judgment by using /proc/vmstat, and then conduct a more in-depth analysis by combining Page Cache’s tracepoints.

Next, let’s analyze two specific problems together.

Is the high load caused by Page Cache? #

I believe you have encountered this scenario before: the business is running steadily but suddenly experiences significant performance fluctuations, or the system is running steadily but suddenly has a high load value. How do you determine if this problem is caused by Page Cache? Here, based on my years of experience, I have summarized some steps for analysis.

The first step in analyzing the problem is to understand the system overview. For Page Cache-related issues, I recommend using sar to collect an overview of Page Cache. It is a tool pre-configured in the system and is very simple and convenient to use.

In Lecture 1 of the course, I also mentioned some uses of sar, such as analyzing paging statistics with sar -B and analyzing memory utilization statistics with sar -r. Here, I especially recommend using the PSI (Pressure-Stall Information) information recorded in sar to check the pressure on Page Cache, especially the pressure on the business, which is ultimately reflected in the load. However, this feature requires a kernel version of 4.20 or above and sar version 12.3.3 or above. For example, the following output represents memory pressure in PSI:

some avg10=45.49 avg60=10.23 avg300=5.41 total=76464318
full avg10=40.87 avg60=9.05 avg300=4.29 total=58141082

You need to pay attention to the avg10 column, which represents the average pressure on memory in the last 10 seconds. If it is high (e.g., greater than 40), it is highly likely that the high load is caused by memory pressure, especially Page Cache pressure.

After understanding the overview, we need to further investigate which behavior of Page Cache is causing system pressure.

Since sar collects only some common metrics and does not cover all behaviors of Page Cache, such as memory compaction and business working set, which can easily cause high load, we need to collect these metrics when we want to analyze more specific causes. Usually, when Page Cache has problems, one or more of these metrics will be abnormal. Here are some common metrics:

Metrics

After collecting these metrics, we can analyze what caused the Page Cache exception. For example, when we find that compact_fail changes significantly over a certain period of time, it often means that the system has severe memory fragmentation and it is difficult to request contiguous physical memory. In this case, you need to adjust the fragmentation index or manually trigger memory compaction to alleviate the pressure caused by memory fragmentation.

The data metrics collected in the previous steps can help us identify the specific problem points, such as the following. However, sometimes we also need to know what is requesting contiguous memory in order to make more targeted adjustments. This requires further observation. We can use the relevant tracepoints pre-defined in the kernel for more detailed analysis.

Tracepoints

Let’s continue with memory compaction as an example to see how to observe it using tracepoints:

# First, enable some tracepoints related to compaction
$ echo 1 > /sys/kernel/debug/tracing/events/compaction/mm_compaction_begin/enable
$ echo 1 > /sys/kernel/debug/tracing/events/compaction/mm_compaction_end/enable 

# Then, read the information; information will be output when compaction events are triggered
$ cat /sys/kernel/debug/tracing/trace_pipe
<...>-49355 [037] .... 1578020.975159: mm_compaction_begin: 
zone_start=0x2080000 migrate_pfn=0x2080000 free_pfn=0x3fe5800 
zone_end=0x4080000, mode=async
<...>-49355 [037] .N.. 1578020.992136: mm_compaction_end: 
zone_start=0x2080000 migrate_pfn=0x208f420 free_pfn=0x3f4b720 
zone_end=0x4080000, mode=async status=contended

From the information in this example, we can see that process 49355 triggered compaction. By subtracting the timestamps of the begin and end tracepoints, we can determine the delay caused by compaction for the business. In this case, we can calculate that the delay for this time is 17ms.

In many cases, due to the large amount of collected information, we often need to use some automated analysis tools to analyze it, which is more efficient. For example, I previously wrote a perf script to analyze the latency caused by direct memory reclamation on the business. You can also refer to Brendan Gregg’s direct reclaim snoop based on bcc(eBPF) to observe the latency caused by direct reclaim for processes.

Was the high system load caused by Page Cache? #

The above question pertains to a real-time occurrence. For real-time issues, it is relatively easier to analyze because there is on-site information available for collection. However, sometimes we are not able to collect on-site information in a timely manner. For example, if a problem occurs during the late night and we are unable to collect on-site information in time, we can only rely on historical records.

We can determine what happened at that time based on the log information from “sar”. I have encountered a similar issue before.

Once, a business reported that the response time (RT) had significant fluctuations and asked me to analyze the cause of the fluctuations. I compared the RT fluctuations with the moments when “pgscand” in “sar -B” was not equal to 0, and found that they often matched up. Therefore, I inferred that the RT fluctuations were related to Page Cache reclamation. Then, I asked the business side to adjust “vm.min_free_kbytes” to verify the effectiveness. After the business side adjusted the value from the initial value of 90112 to 4G, the effect was immediate and there were almost no fluctuations.

Here, I would like to reiterate the fact that adjusting “vm.min_free_kbytes” carries some risks. If the system’s own memory reclamation is already tight, increasing this value may trigger an OOM or even cause the system to crash. Therefore, when adjusting the value, some checks must be performed first to see if it is safe to make the adjustment.

Of course, if you have a newer sysstat version and a higher kernel version, you can also observe the PSI logs to see if they align with the business’s fluctuations. Based on this information from “sar”, we can infer whether the issue is related to Page Cache.

Since we are using “sar” logs for evaluation, there is a certain requirement for the richness of the log information. You need to summarize some common issues and record the indicators related to these common issues in the log for subsequent analysis. This will help you analyze the problems more comprehensively, especially for issues that occur frequently.

For example, our business used to experience frequent business fluctuations. After analyzing the issue using the aforementioned methods and identifying the problem was caused by compaction, which occurred more frequently, we recorded the compaction-related indicators in “/proc/vmstat” (including the specific indicators mentioned in the table above) in our log system. When the business experiences fluctuations again, we can determine whether they are related to compaction based on the log information.

Class Review #

Okay, we have come to the end of this lesson. Let’s do a quick review. In this class, we discussed the analytical methodology for Page Cache issues. By following this methodology, we can analyze almost all Page Cache-related problems and understand the memory access patterns of the business, which helps us optimize the business better.

Of course, this analysis methodology is not only applicable to problems caused by Page Cache but also to problems caused by other aspects of the system. Let’s review these key points again:

When observing the behavior of Page Cache, you can start with simple and easy-to-use analysis tools such as sar to get an overview. Then, you can use more professional tools such as tracepoint to conduct more detailed analysis. This way, you can understand the detailed behavior of Page Cache and why it causes problems.
For many sporadic problems, it often requires collecting a lot of information to capture the problem scene. In such scenarios, it is best to use perf script to write some automated analysis tools to improve efficiency.
If you are concerned that the analysis tools will affect the performance of the production environment, you can collect the information and perform offline analysis or use eBPF for automatic filtering and analysis. Please note that eBPF requires support from a higher version of the kernel.

This is the methodology I have accumulated for problem diagnosis. I also hope that you will not evade problems when encountering them and that you will seek the root causes. I believe you will have your own problem analysis methodology and will be able to quickly and efficiently identify the causes when problems arise.

Homework #

Assuming that the memory is tight now and many processes are performing direct memory reclamation, how can we determine which processes are performing direct memory reclamation? Please share your thoughts in the comments section.

Thank you for reading. If you found this lesson helpful, please feel free to share it with your friends. See you in our next lecture.