04 Case Study How to Address Business Performance Issues Triggered by Easily Reclaimable Page Cache

04 Case Study How to Address Business Performance Issues Triggered by Easily Reclaimable Page Cache #

Hello, I’m Shaoyafang. In the previous lesson, we talked about the problem of high load caused by the difficulty of reclaiming Page Cache. This type of problem is very intuitive and I believe that many people have encountered it. In this lesson, we will discuss the opposite problem, which is the problem caused by the easy reclamation of Page Cache.

Because this type of problem is not intuitive, there are many pitfalls. Application developers and operations personnel are more likely to fall into these traps. It is precisely because these problems are not intuitive that they often realize what the problem is after repeatedly falling into it.

I will summarize the common problems you often encounter, which can be roughly divided into two aspects:

Mistaken operations lead to the reclamation of Page Cache, resulting in a significant decrease in business performance.
Some mechanisms of the kernel cause the business Page Cache to be reclaimed, thereby causing performance degradation.

If your business is sensitive to Page Cache, for example, if your business data is sensitive to latency, or more specifically, if your business metrics have high requirements for TP99 (99th percentile), then you should have some exposure to these performance issues. Of course, this does not mean that if your business is not sensitive to latency, you do not need to pay attention to these issues. Paying attention to these issues will deepen your understanding of business behavior.

To get to the point, let’s take a look at some case studies that occurred in a production environment.

Performance degradation caused by improper handling of Page Cache operations #

Let’s start with a relatively simple case to analyze how improper operations can lead to the Page Cache being dropped.

As we know, the Page Cache can be cleared using drop_cache. Many people tend to use drop_cache to clear a large amount of Page Cache when they see it in the system. However, this approach can have some negative impacts, such as a performance degradation after the Page Cache is cleared. Why is that?

This is actually related to inodes. What does inode mean? An inode is an index in memory for a disk file. When a process searches for or reads a file, it operates through the inode. The following diagram illustrates this relationship:

As shown in the diagram, the process uses the inode to find the file’s address space and, combined with the file offset (which is converted into a page index), it finds the specific page. If the page exists, it means that the file content has been read into memory. If the page does not exist, it means that it is not in memory and needs to be read from the disk. You can understand that the inode is the host of the Page Cache page. If the inode no longer exists, the Page Cache page will also no longer exist.

If you have used drop_cache to release inodes, you should be aware that it has several control options. You can release different types of caches (user data Page Cache, kernel data Slab, or both) by writing different values. You can check the Kernel Documentation for a description of these options.

This introduces a problem that is easy to overlook: when we execute echo 2 to drop the slab, it also drops the Page Cache. Many operators ignore this.

When the system is low on memory, operators or developers may want to free up some memory by using drop_caches. However, because they are aware that releasing the Page Cache will affect the performance of the system, they only want to drop the slab without dropping the Page Cache. So at this point, many people run echo 2 > /proc/sys/vm/drop_caches, but the result is unexpected: the Page Cache is also released, leading to a noticeable performance degradation.

Many people have encountered this scenario: the system is running normally, and suddenly the Page Cache is released. Since they are not sure of the reason, many people suspect that someone/program executed drop_caches causing it. Is there a way to observe the release of inodes and the subsequent release of the Page Cache? The answer is yes. We will discuss this in the next lesson. For now, let’s analyze how to observe whether someone or some program has executed drop_caches.

Since drop_caches is a memory event, the kernel records this event in /proc/vmstat, so we can use /proc/vmstat to determine whether drop_caches has been executed.

$ grep drop /proc/vmstat
drop_pagecache 3
drop_slab 2

As shown above, they respectively mean that the page cache has been dropped 3 times (via echo 1 or echo 3), and the slab has been dropped 2 times (via echo 2 or echo 3). If these two values do not change before and after the problem occurs, it can be ruled out that someone executed drop_caches. Otherwise, it can be attributed to the Page Cache being reclaimed due to drop_caches.

In addition to reconsidering before executing drop_cache, there are other solutions to address this type of problem. Before discussing these solutions, let’s look at a slightly more complex case. Although they share some similarities, the solutions are similar. The only difference is that the next case involves more complex kernel mechanisms.

Business performance decline caused by kernel mechanism leading to Page Cache recycling #

As mentioned earlier, memory reclamation is triggered when memory becomes scarce. During memory reclamation, the system attempts to reclaim reclaimable memory, which includes both Page Cache and reclaimable kernel memory (such as slab). We can use the following diagram to illustrate this process:

I will explain the diagram briefly. The “Reclaimer” refers to the entity responsible for memory reclamation, which can be a kernel thread (such as kswapd) or a user thread. During the reclamation process, the reclaimer scans the pagecache pages and slab pages to identify which ones can be reclaimed. If there are pages that can be reclaimed, it will attempt to reclaim them; otherwise, it will skip to the next step. In the beginning of the scanning process, the reclaimer scans a smaller proportion of the reclaimable pages, gradually increasing the scanning proportion until all pages have been scanned. This is the general process of memory reclamation.

Now let’s discuss a case that occurs in “reclaim slab”. As we have learned from the previous case, if an inode is reclaimed, its corresponding Page Cache will also be reclaimed. Therefore, if the inode of a file being accessed by a business process is reclaimed, all Page Cache associated with that file will be released, which can lead to performance issues.

Is there a way to observe this behavior? Yes, you can observe it through /proc/vmstat, which is a powerful tool for monitoring various aspects of the system (that’s why I mentioned earlier that kernel developers are more accustomed to monitoring /proc/vmstat).

$ grep inodesteal /proc/vmstat
pginodesteal 114341
kswapd_inodesteal 1291853

The relevant event for this behavior is “inodesteal”, indicated by the above two events. “kswapd_inodesteal” represents the number of pagecache pages released due to inode reclamation during the kswapd reclamation process, while “pginodesteal” represents the number of pagecache pages released due to inode reclamation by other threads outside of kswapd. Therefore, if you find that your business’s Page Cache is being released, you can monitor these events to determine if they are the cause.

Now that we understand how Page Cache can be reclaimed and know how to observe it, let’s look at how to address these issues.

How to Avoid Performance Issues Caused by Page Cache Reclamation? #

When analyzing certain issues, we often wonder whether the problem lies with our module or with someone else’s module. In other words, should we modify our module to solve the problem or should we modify other modules? Similarly, there are two approaches to avoid the important data in the Page Cache from being reclaimed:

Optimize at the application code level;
Adjust at the system level.

Solving it at the application code level is a more thorough solution because the application knows which Page Cache is important and which is not. Therefore, the application can handle the Page Cache generated during the file read/write process differently. For example, for important data, we can protect it from being reclaimed or dropped by using mlock(2). For less important data, such as logs, we can use madvise(2) to inform the kernel to release these Page Cache immediately.

Let’s look at an example of protecting important data from being reclaimed or dropped using mlock(2):

#include <sys/mman.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#include <string.h>
#include <fcntl.h>


#define FILE_NAME "/home/yafang/test/mmap/data"
#define SIZE (1024*1000*1000)


int main()
{
    int fd; 
    char *p; 
    int ret;

    fd = open(FILE_NAME, O_CREAT|O_RDWR, S_IRUSR|S_IWUSR);
    if (fd < 0)
        return -1; 

    /* Set size of this file */
    ret = ftruncate(fd, SIZE);
    if (ret < 0)
        return -1; 

    /* The current offset is 0, so we don't need to reset the offset. */
    /* lseek(fd, 0, SEEK_CUR); */

    /* Mmap virtual memory */
    p = mmap(0, SIZE, PROT_READ|PROT_WRITE, MAP_FILE|MAP_SHARED, fd, 0); 
    if (!p)
        return -1; 

    /* Alloc physical memory */
    memset(p, 1, SIZE);

    /* Lock these memory to prevent from being reclaimed */
    mlock(p, SIZE);

    /* Wait until we kill it specifically */
    while (1) {
        sleep(10);
    }

    /*
     * Unmap the memory.
     * Actually the kernel will unmap it automatically after the
     * process exits, whatever we call munamp() specifically or not.
     */
    munmap(p, SIZE);

    return 0;
}

In this example, we lock the Page Cache corresponding to the content of the file FILE_NAME using mlock(2). After running the above program, we can observe how to confirm this behavior: whether these Page Cache have been protected and how much has been protected. This can also be observed through /proc/meminfo:

$ egrep "Unevictable|Mlocked" /proc/meminfo 
Unevictable:     1000000 kB
Mlocked:         1000000 kB

You will find that drop_caches or memory reclamation are not able to reclaim these contents, and our goal has been achieved.

In some cases, modifying the source code of the application can be inconvenient. If we can achieve the goal without modifying the source code, it would be better. The Linux kernel also provides a mechanism, called memory cgroup protection, to protect important data from the system level without modifying the application’s source code.

The basic idea is to use memory cgroup to protect the application that needs to be protected, so that the Page Cache generated during the file read/write process of the application will be protected and not be reclaimed or be reclaimed last. The principle of memory cgroup protection is roughly illustrated in the following diagram:

memory cgroup protection

As shown in the above diagram, memory cgroup provides several memory watermarks memory.{min, low, high, max}:

memory.max - This indicates the maximum memory that processes within the memory cgroup can allocate. If not set, there is no memory size limitation by default.
memory.high - If this is set, once the memory usage of processes within the memory cgroup exceeds this value, it will be immediately reclaimed. Therefore, the purpose of this setting is to reclaim inactive Page Cache as soon as possible.
memory.low - This setting is used to protect important data. When the memory usage of processes within the memory cgroup falls below this value and memory reclamation is triggered due to memory shortage, it will first reclaim Page Cache that does not belong to this memory cgroup. Once the other Page Cache has been reclaimed, it will then reclaim these Page Cache.
memory.min - This setting is also used to protect important data. Unlike memory.low, even if other Page Cache not belonging to this memory cgroup has been reclaimed, these Page Cache will not be reclaimed. This can be understood as protecting the highest priority data.

So, if you want to protect your Page Cache from being reclaimed, you can consider putting your application processes in a memory cgroup and set memory.{min,low} to protect it. On the contrary, if you want to release your Page Cache as soon as possible, you can consider setting memory.high to release inactive Page Cache promptly.

We won’t discuss further details of more specific settings here, but I suggest you try setting them yourself and observe the results. This way, you will have a deeper understanding.

Class Summary #

In the previous session, we discussed the issue of high load caused by the difficulty of reclaiming the Page Cache. This is a very intuitive problem. In this session, we will talk about the opposite problem, which is the issue of Page Cache being too easily reclaimed, causing some problems. This type of problem is not as intuitive.

For intuitive problems, it is relatively easy for us to observe and analyze, and because they are relatively easy to observe, they are also relatively easy to receive attention. However, for non-intuitive problems, they are not so easy to observe and analyze, and they are more likely to be overlooked.

Just because the external characteristics are not obvious does not mean that the impact is not serious. It is like when the skin is injured and bleeding, we know that we need to stop the bleeding immediately, and this injury can be easily controlled. If it is an internal injury, such as problems with the heart, liver, spleen, lungs, or kidneys, it is easy to be ignored. However, once these problems accumulate for a long time, they often cause serious consequences. Therefore, for these non-intuitive problems, we still need to pay attention to them and try to prevent them in advance, such as:

If your business is sensitive to the latency caused by the Page Cache, it is better to protect it. For example, you can use mlock or memory cgroup to protect it.
If you do not understand the reason why the Page Cache is released, you can observe some indicators in /proc/vmstat to find the specific release reason and then optimize accordingly.

Homework #

The homework assigned to you in this lesson is related to mlock. Please think about it. When a process calls mlock() to protect memory, and then the process exits without calling munlock(), is this memory still protected after the process exits? Why? Feel free to share your thoughts in the comments section.

Thank you for reading. If you found this lesson helpful, please feel free to share it with your friends. See you in the next lesson.