02 Basics How Page Cache Is Generated and Released

02 Basics How Page Cache is Generated and Released #

Hello, I am Shaoyafang.

In the previous lesson, we mainly discussed “What is Page Cache” (What) and “Why do we need Page Cache” (Why). In this lesson, we will continue to understand the “How”: that is, how Page Cache is generated and released.

In my opinion, only after understanding the “What-Why-How” of Page Cache, will you have a clearer understanding of the problems it can cause, such as high load caused by Page Cache or application RT jitter, and take preventive measures in advance.

In fact, how Page Cache is generated and released can be explained in simple terms as its “birth” (allocation) and “death” (release), that is, the lifecycle of Page Cache. So next, let’s take a look at how it is “born”.

How is Page Cache “born”? #

There are two different ways in which Page Cache is generated:

  • Buffered I/O (standard I/O);
  • Memory-Mapped I/O.

So how does each of these methods generate Page Cache? Let’s take a look at the following diagram:

From the diagram, you can see that although both methods generate Page Cache, there are still some differences:

With standard I/O, data is written (write(2)) to the user buffer (corresponding to userspace page in memory) and then the data in the user buffer is copied to the kernel buffer (corresponding to pagecache page in memory). For reading (read(2)), the data is first copied from the kernel buffer to the user buffer, and then read from the user buffer. In other words, there is no mapping relationship between the buffer and the file contents.

On the other hand, for memory-mapped I/O, the pagecache page is directly mapped to the user address space, and the user can read and write the contents of the pagecache page.

Obviously, memory-mapped I/O is more efficient than standard I/O because it eliminates the process of copying data between user space and kernel space. This is the main reason why many developers find that memory-mapped I/O performs better than standard I/O.

Let’s demonstrate how Page Cache is “born” with a specific example, using standard I/O as an example since it is the most commonly used method. The following is a simple script:

#!/bin/sh

# File to parse
MEM_FILE="/proc/meminfo"

# New file to be created by this script
NEW_FILE="/home/yafang/dd.write.out"

# Page cache items to parse
active=0
inactive=0
pagecache=0

IFS=' '

# Read the file page cache size from /proc/meminfo
function get_filecache_size()
{
        items=0
        while read line
        do
                if [[ "$line" =~ "Active:" ]]; then
                        read -ra ADDR <<<"$line"
                        active=${ADDR[1]}
                        let "items=$items+1"
                elif [[  "$line" =~ "Inactive:" ]]; then
                        read -ra ADDR <<<"$line"
                        inactive=${ADDR[1]}
                        let "items=$items+1"
                fi  


                if [ $items -eq 2 ]; then
                        break;
                fi  
        done < $MEM_FILE
}

# Read the initial file page cache size
get_filecache_size
let filecache="$active + $inactive"

# Write a new file with a size of 1048576 KB
dd if=/dev/zero of=$NEW_FILE bs=1024 count=1048576 &> /dev/null

# Read the file page cache size again after writing the file
get_filecache_size

# The difference between the two sizes is approximately the file's corresponding file page cache
# The approximation is because there may be other page cache generated during the running process
let size_increased="$active + $inactive - $filecache"

# Output the result
echo "File size 1048576KB, File Cache increased" $size_inc

I want to remind you that before running this script, you should ensure that your system has enough free memory (to avoid memory shortages and triggering reclamation behavior). The final test result is as follows:

File size 1048576KB, File Cache increased 1048648KB

With this script, you can see that the “Active(file)” and “Inactive(file)” items in the /proc/meminfo file increase as the file size increases. The increase in size is consistent with the file size (there is a slight difference because there may be other programs running in the system). Additionally, if you observe carefully, you will notice that the increased Page Cache is in the “Inactive(File)” item. You can think about why this is? This will be our thinking question for this lesson.

Of course, this process may seem simple, but it involves many kernel mechanisms. In other words, there are many potential issues that can arise. Let’s use a diagram to briefly describe this process:

This process can be roughly described as follows: data is initially written to the user buffer (which is the userspace page), then the data is copied from the buffer to the kernel buffer (which is the pagecache page). If the page does not exist in the kernel buffer, a Page Fault occurs and the kernel allocates a new page. After the copy is complete, the pagecache page becomes a Dirty Page. The content of the Dirty Page is then synchronized to the disk, and after synchronization, the pagecache page becomes a Clean Page and continues to exist in the system.

I suggest you understand the “birth” of Page Cache as the allocation of a page and consider the Dirty Page as the infantile stage of Page Cache (the most susceptible to illness). On the other hand, consider the Clean Page as the adult stage of Page Cache (where it is less likely to become ill).

But please note that not everyone has a childhood. For example, Sun Wukong (the Monkey King) was born as an adult. The same applies to Page Cache. If it is generated by reading a file, its content is consistent with the disk content, so it starts as a Clean Page, unless the content is modified, in which case it becomes a Dirty Page (rejuvenated).

Just like how we take care of infants to ensure healthy growth, we also need some means to observe Page Cache during its infantile stage to detect or prevent any issues. For example:

$ cat /proc/vmstat | egrep "dirty|writeback"
nr_dirty 40
nr_writeback 2

As shown above, “nr_dirty” represents the number of dirty pages in the system, and “nr_writeback” represents the number of dirty pages being written back to the disk. Both values are expressed in Pages (4KB).

Under normal circumstances, it is not a problem for dirty pages (infants) to gather together (dirty page accumulation). However, during extraordinary situations like an epidemic, the more infants (dirty pages) gather together, the more severe the illness becomes. Similarly, if too many dirty pages accumulate, it can also lead to problems in certain situations. We will discuss which situations can cause problems and what kind of problems in specific cases.

Now that we understand the “birth” of Page Cache, let’s take a look at how Page Cache is released, or in other words, how it “dies”.

How does Page Cache “die”? #

You can think of the recycling behavior of Page Cache (Page Reclaim) as the “natural death” of Page Cache.

To put it simply, as we know, as a server runs for a long time, the amount of free memory in the system becomes less and less. When you use the free command to check, most of the memory will be used or buff/cache memory. For example, in the memory usage of a production environment server below:

$ free -g
       total  used  free  shared  buff/cache available
Mem:     125    41     6       0          79        82
Swap:      0     0     0

The memory in the buff/cache field of the free command is the “alive” Page Cache. When will they “die” (be reclaimed)? Let’s take a look at an image:

As you can see, when an application requests memory, even if there is no free memory, as long as there is enough recyclable Page Cache, memory can be allocated by reclaiming Page Cache. There are mainly two methods of reclamation: direct reclamation and background reclamation.

So how are they specifically reclaimed? How can you observe it? In my opinion, the simplest and most convenient way to observe direct reclamation and background reclamation of Page Cache is to use sar:

$ sar -B 1
02:14:01 PM  pgpgin/s pgpgout/s   fault/s  majflt/s  pgfree/s pgscank/s pgscand/s pgsteal/s    %vmeff


02:14:01 PM      0.14    841.53 106745.40      0.00  41936.13      0.00      0.00      0.00      0.00
02:15:01 PM      5.84    840.97  86713.56      0.00  43612.15    717.81      0.00    717.66     99.98
02:16:01 PM     95.02    816.53 100707.84      0.13  46525.81   3557.90      0.00   3556.14     99.95
02:17:01 PM     10.56    901.38 122726.31      0.27  54936.13   8791.40      0.00   8790.17     99.99
02:18:01 PM    108.14    306.69  96519.75      1.15  67410.50  14315.98     31.48  14319.38     99.80
02:19:01 PM      5.97    489.67  88026.03      0.18  48526.07   1061.53      0.00   1061.42     99.99

With the help of these metrics, you can observe the memory reclamation behavior more clearly. The specific meanings of these metrics are as follows:

  • pgscank/s: The number of pages scanned per second by kswapd (background reclamation thread).
  • pgscand/s: The number of pages directly scanned per second by the application during memory allocation.
  • pgsteal/s: The number of pages scanned per second that are reclaimed.
  • %vmeff: pgsteal/(pgscank+pgscand), the reclamation efficiency. The closer it is to 100, the safer the system is, and the closer it is to 0, the greater the memory pressure on the system.

These metrics are also derived from parsing the data in /proc/vmstat, and the corresponding relationships are as follows:

Speaking of these metrics, let me tell you a little story. You should know that if the Linux Kernel itself is poorly designed, it can cause trouble for you. So, if you observe that the results of your application are inconsistent with your expectations, it may also be due to problems in the kernel design. You can have a reasonable doubt about the kernel. Below is a case I recently encountered.

If you are familiar with Linus, you should know that Linus’ first principle in designing the Linux Kernel is “never break the user space.” Many kernel developers ignore the impact of new features on applications when designing kernel features. For example, some time ago, someone (a kernel developer from Google) submitted a patch to modify the meaning of these memory reclamation metrics. But in the end, I and another person (a kernel developer from Facebook) vetoed this change. The details are not the focus of our lesson, so I won’t say much about it. I suggest you take a look at this discussion in your own time: [PATCH] mm: vmscan: consistent update to pgsteal and pgscan. You can see how kernel developers think when designing kernel features, which will help you have a more comprehensive understanding of the whole system and make your application better integrate into the system.

Class Summary #

That’s all for today’s class. In this class, we mainly talked about how Page Cache is “born” and how it “dies”. I want to emphasize a few key points:

  • Page Cache is generated during the process of reading and writing files in an application, so you need to pay attention to whether there is enough memory to allocate Page Cache before reading and writing files.
  • Dirty pages in the Page Cache can easily cause problems, so you need to pay special attention to this area.
  • When the system’s available memory is insufficient, the Page Cache will be reclaimed to free up memory. I suggest you observe this behavior through sar or /proc/vmstat in order to better determine whether the issue is related to reclamation.

In general, the lifecycle of the Page Cache is relatively transparent to applications, as its allocation and reclamation are managed by the operating system. It is precisely because of this “transparent” characteristic that applications find it difficult to control the Page Cache, and it is easy for the Page Cache to cause many problems. In the upcoming case study section, we will see what kind of problems can be caused and what your correct analytical approach should be.

Homework after class #

Because everyone has different areas of focus, their understanding of the problem also varies. Suppose you are an application developer, you will pay more attention to the performance and stability of the application; if you are an operations personnel, you will pay more attention to the stability of the system; if you are a beginner in kernel development, you will be interested in the implementation mechanism of the kernel.

So I have different homework questions, focusing on the relationship between “Inactive” and “Active Page Cache”, of course, with different difficulty levels:

  • If you are an application developer, I would like to ask you why the page cache is “Inactive” when you read or write a file for the first time. How do you make it become “Active”? Under what circumstances will an “Active” page cache become “Inactive” again? Understanding this question will give you a deeper understanding of application performance tuning.

  • If you are an operations personnel, I suggest you consider which control items in the system can affect the size of the “Inactive” and “Active” page caches, or the ratio between the two.

  • If you are a beginner in kernel development, I would like to ask you, for anonymous pages, after generating an anonymous page, it will initially be placed on the “Active” list; for file pages, after generating a file page, it will initially be placed on the “Inactive” list. Why is it like this? Is this reasonable? Feel free to share your thoughts in the comments section.

Thank you for reading. If you think the content of this lesson is helpful, please feel free to share it with your friends. See you in the next lecture.