03 Case Study How to Handle High Load Issues Caused by Difficult to Reclaim Page Cache

03 Case Study How to Handle High Load Issues Caused by Difficult-to-Reclaim Page Cache #

Hello, I’m Shaoyafang. In today’s lesson, I want to talk to you about how to handle the problem of high system load caused by improper Page Cache management in a production environment.

I believe that in your usual work, you may have encountered these situations to some extent: the system is very slow, and command responses are extremely delayed; the RT (response time) of the application program becomes very high or fluctuates severely. When these problems occur, it is very likely that the system load will spike.

So, what causes this? Based on my observations, there are mainly three situations:

High load caused by direct memory reclamation;
High load caused by excessive accumulation of dirty pages in the system;
High load caused by improper NUMA (Non-Uniform Memory Access) strategy configuration in the system.

These are the most frequently consulted situations by application developers and system administrators. The problems may seem simple, but if you don’t have a deep understanding of the causes, solving them can be quite tricky. Moreover, if the configuration is not done properly, it can lead to negative effects.

Therefore, in this lesson, we will analyze these three situations together. It can be said that once you understand these situations, you can solve the majority of problems caused by Page Cache and high load. If you are interested in troubleshooting the causes of the problem, don’t worry, in Lesson 5, I will teach you the analysis methods for high load issues.

Next, let’s analyze these situations one by one.

Direct Memory Reclamation Causing High Load or Business Latency Jitter #

Direct memory reclamation refers to the process of memory reclamation being performed synchronously in the process context. So how does it specifically cause high load?

During the process of direct memory reclamation, memory is reclaimed synchronously while the process is requesting memory. This reclamation process can consume a lot of time, causing subsequent actions of the process to wait, resulting in long delays and increased CPU utilization, ultimately leading to high load.

Let’s describe this process in detail. To avoid getting into too many technical details, I will use a diagram to make it easier for you to understand.

From the diagram, you can see that after memory reclamation starts, asynchronous background reclamation is performed first (marked in blue in the image), which does not cause delays in the process. If the asynchronous background reclamation cannot keep up with the speed of memory allocation in the process, synchronous blocking reclamation will start, leading to delays (marked in red and pink in the image, which are the addresses that cause high load).

So, to solve the problem of high load or business RT (response time) jitter caused by direct memory reclamation, one solution is to trigger background reclamation early to avoid direct memory reclamation by the application. How can we do that specifically?

First, let’s understand the principle of background reclamation, as shown in the following diagram:

It means that when the memory level drops below the low watermark, kswapd is awakened to perform background reclamation until it reaches the high watermark.

Therefore, we can increase the min_free_kbytes configuration option to trigger background reclamation early. This option ultimately controls the memory reclamation watermark. However, the memory reclamation watermark is a very detailed concept in the kernel, so let’s not discuss it here.

vm.min_free_kbytes = 4194304

For systems with 128GB or more, setting min_free_kbytes to 4GB is reasonable. This is an empirical value we have summarized when dealing with many similar issues. It does not cause excessive memory waste while avoiding the majority of direct memory reclamation.

The setting of this value does not strictly correspond to the total physical memory. As we mentioned earlier, if it is configured improperly, it may cause some side effects. Therefore, before adjusting this value, my suggestion is to gradually increase it. For example, set it to 1GB first, observe if pgscand in sar -B is still non-zero. If it is non-zero, continue to increase it to 2GB and observe again to determine if it needs to be further increased, and so on.

Here, you need to be aware that even if the value is increased significantly, there may still be non-zero pgscand (this is a bit more complex, involving memory fragmentation and consecutive memory allocation, but let’s not delve into it for now, knowing it exists is enough). In this case, you need to consider whether the business can tolerate it. If it can, there is no need to continue increasing it. In other words, increasing the value does not completely avoid direct memory reclamation but controls the behavior of direct memory reclamation within the tolerance range of the business.

This method can be used in the kernel version 3.10.0 and later (corresponding to CentOS-7 or newer updated operating systems).

Of course, there are some drawbacks to doing this: by raising the memory watermark, the amount of memory directly available to the application will be reduced, which, to some extent, wastes memory. So before adjusting this value, you need to first consider what the application cares more about. If it cares more about latency, you can increase the value appropriately. If it cares more about memory usage, you can reduce the value appropriately.

In addition to this, for CentOS-6 (corresponding to the 2.6.32 kernel version), there is another solution:

vm.extra_free_kbytes = 4194304

That is, setting extra_free_kbytes to 4GB. extra_free_kbytes has been deprecated since the 3.10 kernel and later, but because there are still many machines running older kernel versions in production environments, you may be using an older version of the kernel, so it’s still worth mentioning here. The general principle is shown below:

The purpose of extra_free_kbytes is to solve the memory waste caused by min_free_kbytes. However, this approach has not been accepted by the mainline kernel because it is difficult to maintain and can cause some trouble. If you are interested, you can take a look at this discussion: add extra free kbytes tunable

In summary, by adjusting the memory watermark, the application’s memory allocation is guaranteed to some extent. However, it also brings some memory waste because the system needs to ensure a certain amount of free memory. This compresses the space for Page Cache. You can observe the effect of the adjustment from /proc/zoneinfo:

$ egrep "min|low|high" /proc/zoneinfo
...
        min      7019   
        low      8773
        high     10527
...

Among them, min, low, and high correspond to the three memory watermarks in the figure above. You can observe the changes in min, low, and high before and after the adjustment. It is important to note that the memory watermarks are set for each memory zone, so there will be many zones and their memory watermarks in /proc/zoneinfo. You don’t need to pay attention to these details.

High load caused by excessive dirty pages in the system #

Next, let’s analyze the situation where high load is caused by excessive dirty pages in the system. As mentioned in the previous case, during the direct reclamation process, if there are a large number of dirty pages, it may involve performing writeback during the reclamation process. This can cause significant delays, and because this process itself is a blocking process, it may further lead to an increase in the number of processes in the “D” state in the system, resulting in a high load value.

Let’s take a look at this figure, which is a typical scenario where dirty pages cause a high load value in the system:

As shown in the figure, if the system has both fast I/O devices and slow I/O devices (such as the ceph RBD device in the figure, or other slow storage devices such as HDD), encountering a page being written back to a slow I/O device during the direct memory reclamation process can lead to significant delays.

Let me add a little more here. This type of problem is actually difficult to trace. In order to better trace the jitter caused by slow I/O devices, I have also submitted a patch to the Linux Kernel for better tracing: mm/page-writeback: introduce tracepoint for wait_on_page_writeback(). This approach adds the writeback device on the original basis, allowing users to better associate writeback with specific devices, thereby determining whether the problem is caused by a slow I/O device (I will focus on the specific analysis method in Lecture 5 of the subsequent analysis section).

So how to solve this type of problem? One relatively easy solution is to control the amount of dirty page data backlog in the system. Many people know that dirty pages need to be controlled, but often they are not clear how to control this degree. If the control of dirty pages is insufficient, it may affect the overall efficiency of the system; if the control of dirty pages is too high, it may still trigger problems. Therefore, let’s see how to measure this “degree” well.

First, you can observe the number of dirty pages in the system using sar -r:

$ sar -r 1
07:30:01 PM kbmemfree kbmemused  %memused kbbuffers  kbcached  kbcommit   %commit  kbactive   kbinact   kbdirty
09:20:01 PM   5681588   2137312     27.34         0   1807432    193016      2.47    534416   1310876         4
09:30:01 PM   5677564   2141336     27.39         0   1807500    204084      2.61    539192   1310884        20
09:40:01 PM   5679516   2139384     27.36         0   1807508    196696      2.52    536528   1310888        20
09:50:01 PM   5679548   2139352     27.36         0   1807516    196624      2.51    536152   1310892        24

kbdirty represents the size of dirty pages in the system, and it is also an interpretation of nr_dirty in /proc/vmstat. You can adjust the following settings to control the number of dirty pages in the system within a reasonable range:

vm.dirty_background_bytes = 0- vm.dirty_background_ratio = 10- vm.dirty_bytes = 0- vm.dirty_expire_centisecs = 3000- vm.dirty_ratio = 20

Adjusting these configuration options has pros and cons. Increasing these values will lead to a backlog of dirty pages, but it may also reduce the number of I/O operations and improve the efficiency of flushing to disk in a single operation. Decreasing these values can reduce the backlog of dirty pages, but it may also increase the number of I/O operations and reduce the efficiency of I/O.

As for how much these values should be adjusted, it varies depending on the system and business. My suggestion is to adjust them while observing. Adjust these values to a level that the business can tolerate, and then observe the service quality (SLA) of the business after the adjustment. Make sure that the SLA is within an acceptable range. You can check the effects of the adjustment using /proc/vmstat:

$ grep "nr_dirty_" /proc/vmstat
nr_dirty_threshold 366998
nr_dirty_background_threshold 183275

You can observe the changes in these two items before and after the adjustment. Here, I want to give you a tips to avoid pitfalls. If the settings in this solution are not set properly, it may trigger a kernel bug. I discovered this kernel bug during performance tuning in 2017, and I submitted a patch to the community to fix it. The specific commit can be found in writeback: schedule periodic writeback with sysctl. The commit log clearly describes the issue. I suggest you take a look when you have time.

High load caused by improper NUMA strategy configuration #

In addition to the two scenarios I mentioned earlier that can cause high system load or business latency jitter, there is another scenario that can also cause high load, which is the improper configuration of the NUMA strategy.

For example, we encountered such a problem in the production environment: there was still about half of the free memory in the system, but direct reclaim was frequently triggered, causing severe business jitter. After investigation, it was found that this was caused by the setting of zone_reclaim_mode, which is one of the NUMA strategies.

The purpose of setting zone_reclaim_mode is to increase the NUMA affinity of the business. However, in actual production environments, there are rarely businesses that are particularly sensitive to NUMA. That is why the kernel changed the default configuration of zone_reclaim_mode from 1 to 0 by default: mm: disable zone_reclaim_mode by default . After configuring it as 0, it avoids the situation where free memory on other nodes is not used and instead the Page Cache on the current node is reclaimed. In other words, by reducing the possibility of memory reclamation and avoiding the business latency it causes.

So how can we effectively measure whether the business latency issue is caused by zone reclaim? How significant is the latency it causes? The measurement and observation method is also contributed by me to the Linux Kernel: mm/vmscan: add tracepoints for node reclaim . The basic idea is to use Linux’s tracepoint to perform this quantitative analysis, which is a relatively low-performance-cost solution.

We can use numactl to view the NUMA information of the server. The following is an example of a server with two nodes:

$ numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 24 25 26 27 28 29 30 31 32 33 34 35
node 0 size: 130950 MB
node 0 free: 108256 MB
node 1 cpus: 12 13 14 15 16 17 18 19 20 21 22 23 36 37 38 39 40 41 42 43 44 45 46 47
node 1 size: 131072 MB
node 1 free: 122995 MB
node distances:
node   0   1 
  0:  10  21 
  1:  21  10

The local node for CPUs 0-11 and 24-35 is node 0, while the local node for CPUs 12-23 and 36-47 is node 1. The following diagram shows the configuration:

Diagram

It is recommended to configure zone_reclaim_mode as 0.

vm.zone_reclaim_mode = 0

Because compared to the harm caused by memory reclamation, the performance improvement brought by NUMA is almost negligible, so the configuration is set as 0, and the benefits outweigh the drawbacks.

Alright, we have analyzed the issues of high system load and business latency jitter caused by improper Page Cache management. I hope that through this study, you will no longer be helpless when encountering high load issues caused by direct memory reclamation.

Overall, these problems are all caused by the difficulty of releasing Page Cache. Have you ever wondered if Page Cache is easy to release, will there still be problems? The answer to this question may surprise you: Page Cache being easy to release also has its own problems. In the next lesson, we will analyze cases related to this aspect.

Class Summary #

In this class, we discussed several cases of high load caused by memory reclaim. To make a vivid analogy, we can compare memory to an important resource in our daily life - money. If you have enough money (memory), you can buy whatever you want without worrying about running out of money (running out of memory) and being in a financial dilemma (causing high load).

However, in reality, the amount of money (memory) each person has for shopping on Singles’ Day is always limited. When buying things (running programs), we need to budget carefully. If the budget is close to being exceeded (memory running low), we have to remove some unimportant items (recycle inactive content) from the shopping cart (reclaim memory) in order to free up funds (free memory) to buy things we really want (run necessary programs).

The cases we discussed can be resolved by adjusting system parameters/settings. Adjusting system parameters/settings is a modification that application developers and operation managers can make when encountering kernel issues. For example, when high load is caused by direct memory reclamation, the memory water level setting needs to be adjusted. When high load is caused by dirty page buildup, the dirty page water level needs to be adjusted. When high load is caused by improper NUMA strategy configuration, it is necessary to check whether the strategy needs to be disabled. At the same time, when making these adjustments, we must observe the quality of service of the business to ensure that the SLA is acceptable.

If you want your system to be more stable and your business performance to be better, you may want to study the configurable options in the system and see which configurations can help your business.

Homework #

The homework I have assigned to you for this lesson is related to direct memory reclamation. Now that you know that direct memory reclamation can cause problems and should be avoided as much as possible, my question is: Please execute some simulation programs to construct scenarios of direct memory reclamation (Hint: You can use ‘pgscand’ from ‘sar -B’ to determine if direct memory reclamation has occurred). Feel free to share your thoughts in the comments section.

Thank you for reading. If you found this lesson helpful, please consider sharing it with your friends. See you in the next lecture.