01 Basics How to Monitor Data With Page Cache

01 Basics How to Monitor Data with Page Cache #

Hello, I’m Shao Yafang. Today I would like to talk to you about the topic of Page Cache.

You should be familiar with Page Cache. If you are an application developer or a Linux administrator, you may have encountered scenarios related to Page Cache in your work. For example:

  • The server’s load spikes.
  • The server’s I/O throughput spikes.
  • Large spikes in response latency for business operations.
  • Noticeable increase in average access latency for business operations.

These issues are likely caused by improper management of the Page Cache. Improper Page Cache management not only increases system I/O throughput but also causes performance fluctuations in business operations. I have dealt with many such issues in production environments.

From my observation, after these issues occur, both developers and administrators often feel helpless. The reason behind this is that their understanding of Page Cache only stays at the conceptual level, and they are not clear on how Page Cache is related to applications and systems. Naturally, they are at a loss when it comes to dealing with problems caused by Page Cache. So, in order to avoid stumbling into Page Cache pitfalls, you must have a clear understanding of it.

In my view, the simplest way to understand Page Cache is through data. By analyzing concrete data, you will gain a deeper understanding of the essence of Page Cache. In order to help you digest and comprehend this, I will take two lessons to analyze what Page Cache is, why it is needed, and how Page Cache is generated and evicted. In this way, you will understand it thoroughly from its essence to its manifestations, and gain a better understanding of the relationship between Page Cache and your application, thus being able to better comprehend the four issues mentioned above.

However, I would like to give you a heads-up here. In order to learn today’s content, it is better to have some basics of Linux programming, such as how to open a file, how to read and write a file, how to close a file, etc. With this foundation, it will be easier for you to understand today’s content. Of course, if you don’t have this knowledge, it’s okay as well. If you encounter something you can’t understand, you can consult the book “Advanced Programming in the UNIX Environment”, which is essential reading for every Linux developer and administrator.

Alright, without further ado, let’s begin today’s lesson.

What is Page Cache? #

I remember that many application developers or operations personnel often ask me for help in solving problems caused by Page Cache, and they always like to ask whether Page Cache belongs to the kernel or the user. In response to such questions, I usually ask them to look at the following image first:

From this image, you can clearly see that the red area represents the Page Cache. Clearly, Page Cache is memory managed by the kernel, which means it belongs to the kernel, not the user.

So how do we observe the Page Cache? In fact, there are many ways to directly view the Page Cache on Linux, including commands like /proc/meminfo, free, and /proc/vmstat. Their content is actually the same.

Let’s take the /proc/meminfo command as an example (if you want to understand the specific meanings of each item in /proc/meminfo, you can refer to the meminfo section of the Kernel Documentation, which provides detailed explanations of the specific meanings of each item. Kernel Documentation is the simplest and most direct way for application developers to understand the kernel).

$ cat /proc/meminfo
...
Buffers:            1224 kB
Cached:           111472 kB
SwapCached:        36364 kB
Active:          6224232 kB
Inactive:         979432 kB
Active(anon):    6173036 kB
Inactive(anon):   927932 kB
Active(file):      51196 kB
Inactive(file):    51500 kB
...
Shmem:             10000 kB
...
SReclaimable:      43532 kB
...

Based on the data above, you can simply conclude the following equation (the sum on both sides is 112696 KB):

Buffers + Cached + SwapCached = Active(file) + Inactive(file) + Shmem + SwapCached

Therefore, the contents on both sides of the equation are what we usually refer to as Page Cache. Please note that you haven’t read it wrong; SwapCached appears on both sides of the equation. The reason why it is included in the equation is to say that it is also part of the Page Cache.

Next, let’s analyze the specific meanings of these items. The items on the right side of the equation divide Buffers and Cached into Active(file), Inactive(file), and Shmem because Buffers depends more on kernel implementation and its meaning may vary in different kernel versions. On the other hand, the right side of the equation has a more direct relationship with the application program. Therefore, we will analyze it from the right side of the equation.

In the Page Cache, Active(file) + Inactive(file) is File-backed page (memory pages corresponding to files) and is the part you should pay the most attention to. This is because the memory consumed by mmap() memory mapping and buffered I/O, which you use in daily operations, belongs to this part. More importantly, this part is also the most likely to cause problems in real production environments. We will focus on analyzing it in the case studies section of the upcoming lessons. And SwapCached is when the anonymous pages in the Inactive(anon) + Active(anon) categories are swapped (swap out) to disk after the Swap partition is enabled, and then read back into memory (swap in) and allocated memory. Since the original Swap File is still in memory after being read back, SwapCached can also be considered as File-backed page, which belongs to Page Cache. The purpose of doing this is to reduce I/O. Do you think this process is complicated? Let’s take a look at a simple diagram to understand how SwapCached is generated:

I hope this simple diagram helps you understand how SwapCached is generated. During this process, please note that SwapCached only exists when the Swap partition is enabled. I recommend disabling the Swap partition in a production environment because the I/O generated by the Swap process can easily cause performance fluctuations.

In addition to SwapCached, the Shmem in Page Cache refers to memory allocated through anonymous shared mapping (the “shared” item in the “free” command), such as tmpfs (a temporary file system). The issues related to this part are relatively rare in real production environments, so we won’t focus on them in today’s lesson. You just need to know that it exists.

Of course, many students also like to use the “free” command to check how much Page Cache exists in the system, based on the “buff/cache” line. If you are familiar with the “free” command, you must know that the “free” command also generates these statistics by parsing /proc/meminfo, which can be found in the source code of the “free” tool. The source code of the “free” command is open source, and you can take a look at the “free.c” file in the procfs repository. The source code is the most direct way to understand the “free” command, and it will deepen your understanding of it.

However, have you ever been curious about what “buff/cache” in the “free” command actually means? Let’s take a look at it here:

$ free -k
              total        used        free      shared  buff/cache   available
Mem:        7926580     7277960      492392       10000      156228      430680
Swap:       8224764      380748     7844016

By examining the source code of procfs, specifically the file proc/sysinfo.c, you can find that “buff/cache” includes the following items:

buff/cache = Buffers + Cached + SReclaimable

Through the previous data, we can also verify this formula: the sum of 1224 + 111472 + 43532 is 156228.

Furthermore, you should note that these data are dynamically changing, and the execution of the command itself also incurs memory overhead. Therefore, this equation may not be strictly equal, but you don’t need to doubt its correctness.

From this formula, you can see that “buff/cache” in the “free” command is composed of “Buffers,” “Cached,” and “SReclaimable.” It emphasizes the reusability of memory, meaning that memory that can be reclaimed is included in this item.

Among them, “SReclaimable” refers to reclaimable kernel memory, including dentry and inode. However, this part is very detailed and may be relatively difficult for application developers and administrators to understand, so we won’t talk about it too much here.

Once you understand the specific components of Page Cache, you will know what to observe when it causes problems. For example, if the application itself consumes little memory (RSS), but the overall system memory usage is still high, you may want to investigate whether Shmem (shared memory) has consumed too much memory.

By now, I think you should have a more intuitive understanding of Page Cache, right? Of course, some people may say, “The kernel’s Page Cache is so complicated, can’t we just get rid of it?”

I believe there are many people who think this way. If you don’t want to use the kernel-managed Page Cache, there are two approaches to consider:

  • The first approach is for the application to maintain its own cache for more fine-grained control. For example, MySQL does this, and you can refer to MySQL Buffer Pool. However, implementing your own cache can be quite complex and costly for most applications, so the kernel’s Page Cache is usually simpler and more efficient.
  • The second approach is to directly use Direct I/O to bypass the Page Cache and not use it at all. Is this method feasible? Let’s continue with the data to see what issues arise from this approach.

Why do we need Page Cache? #

From the first image, you can see that both standard I/O and memory mapping first write the data to the Page Cache. This reduces I/O operations and improves read and write efficiency. Let’s look at a specific example. First, let’s generate a new file of size 1GB and clear the Page Cache to ensure that the file content is not in memory. We compare the difference in time between the first read and the second read of the file. The specific process is as follows.

First, generate a 1GB file:

dd if=/dev/zero of=/home/yafang/test/dd.out bs=4096 count=$((1024*256))

Then, clear the Page Cache. First, execute the sync command to synchronize dirty pages (I will explain what dirty pages are in the next lesson) to the disk. Then, drop the cache.

sync && echo 3 > /proc/sys/vm/drop_caches

The time for the first file read is as follows:

time cat /home/yafang/test/dd.out &> /dev/null
real	0m5.733s
user	0m0.003s
sys	0m0.213s

The time for the second file read is as follows:

time cat /home/yafang/test/dd.out &> /dev/null 
real	0m0.132s
user	0m0.001s
sys	0m0.130s

Through this detailed process, you can see that the time for the second file read is much shorter than the time for the first read. This is because the first read is from the disk, which is time-consuming, while the second read directly accesses the file content from memory, which is much faster than reading from the disk. This is the purpose of the Page Cache: to reduce I/O and improve the I/O speed of applications.

Therefore, if you don’t want to increase the complexity of your application by managing memory in great detail, it is better to rely on the Page Cache managed by the kernel. It is a solution with a relatively high return on investment (ROI).

You should know that it is difficult to find a perfect solution in making decisions for a plan. In most cases, after weighing different factors, we choose a suitable solution. Because I have always believed that the most suitable solution is the best one.

The reason why I say that Page Cache is suitable, rather than saying it is the best, is because Page Cache also has its shortcomings. These shortcomings mainly manifest in the fact that Page Cache is too transparent to the application, making it difficult for the application to control it effectively.

Why do I say this? To find the answer, you need to understand the process of Page Cache generation. I will discuss this with you in the next lesson.

Class Summary #

In this class, we mainly talked about how to understand Page Cache well. In my opinion, the intuitive way to understand it is to start with the data. So, I started by showing you how to observe Page Cache to help you understand what Page Cache is. Then, I discussed why it is prone to problems and reviewed its significance. I hope that by doing this, I have helped you clarify a few key points:

  1. Page Cache belongs to the kernel, not to the user.
  2. Page Cache is a high ROI solution for improving I/O efficiency for applications, so its existence is necessary.

In my opinion, the most important thing for managing Page Cache well is to know how to observe it and its behaviors. With this data as support, you can better integrate it with your business. And, in my opinion, when you are unclear or confused about a concept, the best way to understand it is to first understand how to observe it, and then try observing it yourself to see how it changes. As the saying goes, “Reading ten thousand books is not as good as traveling ten thousand miles.”

This concludes today’s class. In the next class, we will use data to observe the generation and release of Page Cache. By doing so, you will be able to understand the entire lifecycle of Page Cache and have a rough judgment on some of the problems it can cause.

Homework #

Finally, I leave you with a question to ponder: please write a program to construct a Page Cache, and observe how the data in /proc/meminfo and /proc/vmstat changes. Feel free to share your thoughts in the comments.

Thank you for reading. If you find this lesson helpful, please consider sharing it with your friends. See you in the next lecture.