19 Case Study Why System's Swap Increased Part One

19 Case Study Why System’s Swap Increased Part One #

Hello, I’m Ni Pengfei.

In the previous section, I used a Fibonacci sequence as an example to teach you about memory leaks analysis. If you directly or indirectly allocate dynamic memory in your program, you must remember to release them, otherwise it will cause memory leaks, and in severe cases, it may even deplete the system’s memory.

However, when there is a memory leak or when a large memory-intensive application is running, how does the system respond?

In the Memory Basics section, we learned that this can result in two possible outcomes: memory reclamation and OOM process killing.

Let’s first look at the latter outcome, OOM (Out Of Memory) caused by memory shortage. This is relatively easy to understand and refers to the system killing processes that consume a large amount of memory in order to free up that memory and allocate it to other processes that need it more.

We have discussed this in detail before, so we won’t repeat it here.

Next, let’s look at the first possible outcome, memory reclamation, which refers to the system releasing reclaimable memory. For example, caches and buffers that I mentioned earlier are considered reclaimable memory in memory management. They are usually referred to as file-backed pages.

Most file-backed pages can be directly reclaimed and can be reloaded from disk when needed. However, those pages that have been modified by the application and have not yet been written to disk (i.e., dirty pages) must first be written to disk before memory can be released.

These dirty pages can be written to disk in two ways:

Application programmers can use the system call fsync in their programs to synchronize dirty pages to disk.
They can also be handled by the system’s kernel thread, pdflush, which is responsible for flushing these dirty pages.

In addition to caches and buffers, pages obtained through memory mapping are also a common type of file-backed pages. They can also be released and reloaded from the file when accessed again.

Apart from file-backed pages, are there any other memory areas that can be reclaimed? For example, can the heap memory dynamically allocated by the application program, also known as anonymous pages in memory management, be reclaimed as well?

You might say, these pages are likely to be accessed again, so they cannot be directly reclaimed. That’s correct; these memory areas cannot be released directly.

However, if these memory areas are rarely accessed after allocation, it seems like a waste of resources. Can we temporarily store them on the disk and free up memory for other processes that need it more?

This is exactly what Linux’s Swap mechanism does. Swap writes these infrequently accessed memory areas to disk and then releases the memory for other processes that need it more. When these memory areas are accessed again, they can be reloaded from the disk into memory.

In the previous case studies, we have already learned about the principles and analysis of caches and OOM. But how does Swap work? Because there is a lot of content, I will take you through the working principles of Swap and how to analyze it after it increases in the next two lessons.

Today, let’s take a look at how Swap works.

Principle of Swap #

As mentioned earlier, Swap is essentially using a disk space or a local file (in this explanation, we will use disk as an example) as if it were memory. It involves two processes: swapping out and swapping in.

Swapping out refers to storing the unused memory data of a process temporarily to disk and freeing the memory occupied by these data.
Swapping in, on the other hand, means reading these memory data from disk into memory when the process accesses them again.

So you see, Swap actually increases the available memory of the system. This way, even if the server’s memory is insufficient, it can still run applications that require a large amount of memory.

Do you still remember when I first learned the Linux operating system, memory was very expensive and an ordinary student could not afford a large memory? At that time, I used Swap to run the Linux desktop. Of course, memory is much cheaper now, and servers are generally equipped with large memory. Does that mean Swap is no longer useful?

Of course not. In fact, no matter how large the memory is, there are still times when it is not enough for applications.

A very typical scenario is that even when the memory is insufficient, some applications do not want to be killed by OOM (Out of Memory) killer; instead, they want to wait for a period of time, either for human intervention or for the system to automatically release memory from other processes and allocate it to them.

In addition, the hibernation and fast boot functions of our common laptops are also based on Swap. When the system hibernates, the memory is stored in the disk. So, when the system is booted up again, the memory can be loaded directly from the disk. This saves the initialization process of many applications and speeds up the boot process.

Now, coming back to Swap being used to reclaim memory, when does Linux actually need to reclaim memory? We have been talking about memory resource shortage, so how do we measure whether the memory is tight or not?

One scenario that is easy to think of is when there is a new request for a large block of memory allocation, but the remaining memory is insufficient. In this case, the system needs to reclaim a portion of the memory (such as the cache mentioned earlier) to satisfy the new memory request as much as possible. This process is usually referred to as direct memory reclaim.

In addition to direct memory reclaim, there is also a dedicated kernel thread for periodic memory reclamation, which is called kswapd0. To measure the memory usage, kswapd0 defines three memory thresholds (watermarks), namely the minimum threshold (pages_min), the low threshold (pages_low), and the high threshold (pages_high). The remaining memory is represented by pages_free.

Here, I have drawn a diagram to represent their relationship.

kswapd0 scans the usage of memory periodically and performs memory reclamation operations based on where the remaining memory falls within these three thresholds.

If the remaining memory is less than the minimum threshold, it means that the available memory for processes is exhausted, and only the kernel can allocate memory.
If the remaining memory falls between the minimum threshold and the low threshold, it indicates that there is significant memory pressure and the remaining memory is not much. In this case, kswapd0 will perform memory reclamation until the remaining memory exceeds the high threshold.
If the remaining memory falls between the low threshold and the high threshold, it implies that there is some memory pressure, but new memory requests can still be satisfied.
If the remaining memory is greater than the high threshold, it means that there is plenty of remaining memory and there is no memory pressure.

As we can see, once the remaining memory is below the low threshold, memory reclamation will be triggered. This low threshold can actually be indirectly set through the kernel option /proc/sys/vm/min_free_kbytes. min_free_kbytes sets the minimum threshold, and the other two thresholds are calculated based on the minimum threshold as follows:

pages_low = pages_min * 5/4
pages_high = pages_min * 3/2

NUMA and Swap #

In many cases, you may notice that your Swap is increasing, but when analyzing the memory usage of the system, you may find that there is still a lot of remaining memory. Why does Swap occur even when there is plenty of remaining memory?

As you can probably guess from the title above, this is caused by the NUMA (Non-Uniform Memory Access) architecture of the processor.

I briefly mentioned NUMA in the CPU module. In a NUMA architecture, multiple processors are divided into different Nodes, each with its own local memory space.

And the memory space within the same Node can be further divided into different memory domains (Zones), such as the Direct Memory Access Zone (DMA), the Normal Memory Zone (NORMAL), and the Movable Memory Zone (MOVABLE), as shown in the figure below:

Don’t worry too much about the specific meanings of these memory domains. Knowing how to check the threshold configuration, as well as the actual usage of caches and anonymous pages, is enough.

Since each Node in the NUMA architecture has its own local memory space, when analyzing memory usage, we should also analyze each Node separately.

You can use the numactl command to view the distribution of processors in each Node and the memory usage of each Node. For example, here is an example output of numactl:

$ numactl --hardware
available: 1 nodes (0)
node 0 cpus: 0 1
node 0 size: 7977 MB
node 0 free: 4416 MB
...

This interface shows that there is only one Node in my system, which is Node 0, and CPUs with numbers 0 and 1 are both located in Node 0. In addition, Node 0 has a memory size of 7977 MB and 4416 MB of remaining memory.

Now that you understand the architecture of NUMA and how to view NUMA memory, you may wonder how this relates to Swap.

In fact, the three memory thresholds mentioned earlier (page min threshold, page low threshold, and page high threshold) can all be viewed through the /proc/zoneinfo interface for memory domains.

For example, here is an example content of the /proc/zoneinfo file:

$ cat /proc/zoneinfo
...
Node 0, zone   Normal
 pages free     227894
       min      14896
       low      18620
       high     22344
...
     nr_free_pages 227894
     nr_zone_inactive_anon 11082
     nr_zone_active_anon 14024
     nr_zone_inactive_file 539024
     nr_zone_active_file 923986
...

This output contains a large number of metrics, and I will explain some of the more important ones.

min, low, and high under pages are the three memory thresholds mentioned earlier, while free is the number of remaining memory pages, which is the same as nr_free_pages.
nr_zone_active_anon and nr_zone_inactive_anon are the number of active and inactive anonymous pages, respectively.
nr_zone_active_file and nr_zone_inactive_file are the number of active and inactive file pages, respectively.

From this output, you can see that the remaining memory is much higher than the page high threshold, so the kswapd0 does not reclaim memory at this time.

Of course, when the memory of a specific Node is insufficient, the system can search for free memory from other Nodes or reclaim memory from local memory. You can adjust the mode by modifying /proc/sys/vm/zone_reclaim_mode. It supports the following options:

The default mode 0, as mentioned earlier, means that free memory can be searched from other Nodes or reclaimed locally.
1, 2, and 4 all mean reclaiming local memory only, where 2 indicates reclaiming memory with writeback of dirty data, and 4 indicates reclaiming memory with Swap.

swappiness #

Here, we can understand the mechanism of memory reclamation. This reclamation includes both file pages and anonymous pages.

For file pages, the reclamation involves either directly reclaiming the cache or writing dirty pages back to the disk before reclaiming.
For anonymous pages, the reclamation is actually done by writing them to the disk through the Swap mechanism and then releasing the memory.

However, you may still have a question. Since there are two different memory reclamation mechanisms, which one should be reclaimed first in practice?

In fact, Linux provides an option called /proc/sys/vm/swappiness to adjust the degree of actively using Swap.

The range of swappiness is from 0 to 100. The larger the value, the more actively Swap is used, which means it tends to prefer reclaiming anonymous pages; the smaller the value, the more passively Swap is used, which means it tends to prefer reclaiming file pages.

Although the range of swappiness is 0-100, it is important to note that this is not a percentage of memory, but a weight adjustment of the degree of Swap. Even if you set it to 0, when the remaining memory plus file pages is less than the high water threshold, Swap will still occur.

Now that we understand the principle of Swap, how can we locate and analyze when Swap usage becomes higher? Don’t worry, in the next section, we will explore this through a case study.

Summary #

When memory resources are scarce, Linux releases file pages and anonymous pages by means of direct memory reclamation and periodic scanning, so that memory can be allocated to processes that need it more.

The reclamation of file pages is relatively easy to understand: they are either cleared directly or their dirty data is written back to disk before being released.
For the reclamation of anonymous pages, they are swapped out to disk and swapped in from disk when accessed again.

You can adjust the system’s threshold for periodic memory reclamation (i.e., page low threshold) by setting /proc/sys/vm/min_free_kbytes, and you can adjust the tendency to reclaim file pages and anonymous pages by setting /proc/sys/vm/swappiness.

In NUMA architecture, each node has its own local memory space. When local memory is insufficient, it can either search for free memory from other nodes or reclaim memory from local memory by default.

You can adjust the reclamation strategy for NUMA local memory by setting /proc/sys/vm/zone_reclaim_mode.

Reflection #

Finally, I would like to discuss with you about your understanding of SWAP. I assume that you might have encountered performance issues caused by SWAP before. How did you analyze these problems? You can combine the SWAP principle we discussed today, record your operation steps, and summarize your solution approach.

Feel free to discuss with me in the comments area and also feel free to share this article with your colleagues and friends. Let’s practice in real-world scenarios and make progress through communication.