08 Case Study Where Did the Memory Go if the Shmem Process Did Not Consume It

08 Case Study Where Did the Memory Go If the Shmem Process Did Not Consume It #

Hello, I am Shaoyafang.

In the previous lesson, we talked about memory leaks in the process heap and the dangers of OOM caused by memory leaks. In this lesson, we will continue to discuss other types of memory leaks, so that when you notice that the system memory is decreasing, you can think about what might be consuming the memory.

Some memory leaks are reflected in the memory of the process, which is relatively easier to observe. However, some memory leaks are difficult to observe because they cannot be judged by observing the memory consumed by the process. As a result, they are easily overlooked. Shmem memory leaks belong to this category and this lesson will focus on discussing them.

Where did the memory go when the process did not consume it? #

I have encountered a real case in my production environment. Our operations personnel found that the used memory on certain machines was increasing, but they couldn’t determine who was consuming the memory through commands such as top and other tools. As the available memory became less and less, the OOM killer killed the business processes, which had a significant impact on our operations. They sought my help to find out the cause of this issue.

As I mentioned in a previous lesson, when facing memory shortage, the first thing we need to do is to check which memory types consume the most by looking at /proc/meminfo, and then conduct targeted analysis. However, if you are not familiar with the meaning of each item in /proc/meminfo, even if you know which memory types have abnormal values, you won’t know how to continue the analysis. So, it’s best to remember the meaning of each item in /proc/meminfo.

Returning to our case, by checking the /proc/meminfo of these machines, we found that the size of Shmem was abnormal:

$ cat /proc/meminfo
...
Shmem  16777216 kB
...

So, what exactly does Shmem mean? How can we further analyze who is using Shmem?

As we mentioned in the previous fundamentals, Shmem refers to anonymous shared memory, which is memory requested by processes using the mmap(MAP_ANON|MAP_SHARED) method. You may wonder, shouldn’t memory requested in this way belong to the process’s RES (resident) memory? For example, consider the following simple example:

#include <sys/mman.h>
#include <string.h>
#include <unistd.h>
#define SIZE (1024*1024*1024)

int main()
{
    char *p;

    p = mmap(NULL, SIZE, PROT_READ|PROT_WRITE, MAP_ANON|MAP_SHARED, -1, 0);
    if (!p)
        return -1;

    memset(p, 1, SIZE);

    while (1) {
        sleep(1);
    }

    return 0;
}

After running this program, you can see through top command that the memory consumption is indeed reflected in the process’s RES and also in SHR. In other words, if the process requests memory using the mmap method, we can observe its memory consumption through the process’s memory usage.

But in the problem we encountered in our production environment, the RES of various processes was not large, and it seemed unrelated to Shmem in /proc/meminfo. Why is that?

Let’s start with the answer: this is related to a special type of Shmem. We know that disk speed is much slower than memory, so some applications optimize performance by avoiding writing certain data that does not need to be persisted to disk, but instead writing this temporary data to memory. Afterwards, they periodically or when the data is no longer needed, clear this part of the memory to release it. Under this requirement, a special type of Shmem is created: tmpfs. The diagram below shows tmpfs:

tmpfs

tmpfs is an in-memory file system that exists only in memory. Applications do not need to request or release memory for tmpfs. Instead, the operating system automatically allocates a portion of memory for it, and the application simply needs to write data into it, which is very convenient. We can use the mount command or df command to see the mount point of tmpfs in the system:

$ df -h
Filesystem      Size  Used Avail Use% Mounted on
...
tmpfs            16G  15G   1G   94% /run
...

Similar to how a process writes files to disk, once a process finishes writing a file, it closes the file, and these files are no longer associated with the process. Therefore, the size of these disk files is not reflected in the process’s memory consumption. Similarly, the same is true for files in tmpfs. They are also not reflected in the process’s memory usage. By now, you can probably guess that the reason for the high memory consumption of Shmem is that tmpfs is taking up a large portion of the memory.

tmpfs belongs to the file system category. We can use the df command to view the usage of file systems. So, we can also use df to see if tmpfs consumes a lot of memory. The results indeed show that it consumes a lot of memory. The problem becomes clear now, and all we need to do is to analyze the files stored in tmpfs.

We have also encountered a similar problem in our production environment: systemd keeps writing logs to tmpfs without timely cleaning them up, and the initial size of tmpfs is set too large. This results in an increasing amount of logs generated by systemd, which eventually leads to less available memory.

To solve this problem, we can limit the size of tmpfs used by systemd. When the log size reaches the limit, either automatically clean up the temporary logs or periodically delete them. This can be done through the systemd’s configuration file. The size of tmpfs can be adjusted using the following command (e.g., adjust to 2G):

$ mount -o remount,size=2G /run

As a special type of Shmem, tmpfs does not reflect its memory consumption in the process’s memory. This can often bring some difficulties to troubleshooting. To efficiently analyze this type of problem, you must familiarize yourself with the memory types in the system. In addition to tmpfs, there are other types of memory that are not reflected in the process’s memory, such as memory consumed by the kernel: Slab (cache), KernelStack (kernel stack), and VmallocUsed (memory allocated by the kernel through vmalloc). You need to investigate these types of memory when you are not sure which process is consuming the memory.

If the memory consumed by tmpfs continues to increase and cannot be cleaned up, the final result is insufficient available memory, which triggers OOM to kill processes. It is very likely that important processes will be killed or processes that you think should not be killed.

The Dangers of OOM Process Killing #

The logic behind OOM process killing is roughly as shown in the figure below:

When OOM killer kills a process, it scans through the processes that can be killed in the system, calculates the final score of each process based on the amount of memory it occupies and the oom_score_adj configuration, and then kills the process with the highest score (oom_score). If there are multiple processes with the highest score, it kills the first one it scans.

The oom_score of a process can be viewed using /proc/[pid]/oom_score. By scanning the oom_score of all processes in your system, the process with the highest score will be the first one to be killed when an OOM occurs. However, please note that since the oom_score is related to the memory consumption of a process and the memory consumption can dynamically change, the value of oom_score will also change dynamically.

If you do not want a specific process to be killed first, you can adjust the oom_score_adj of that process to change its oom_score. If you want a process to never be killed, you can configure its oom_score_adj as -1000.

Usually, we need to configure the oom_score_adj of important system services as -1000, such as sshd, because once these system services are killed, it becomes difficult to log in to the system again.

However, besides system services, no matter how important your business program is, it is best not to configure it as -1000. Because if your business program has a memory leak and it cannot be killed, as its memory consumption increases, the OOM killer will be constantly invoked, killing other processes one by one. We have encountered similar cases in a production environment.

One of the roles of the OOM killer is to find processes in the system that continuously leak memory and kill them. If it fails to find the correct process, it will mistakenly kill other processes, including more important business processes.

In addition to killing innocent processes, the OOM killer’s strategy for choosing which process to kill is not necessarily correct. Now, it’s time to find fault with the kernel again, which is the purpose of this course series: to teach you how to learn about the Linux kernel, but at the same time, I also want to tell you to be skeptical of the kernel. The following case is a kernel bug.

On one of our servers, we found that the OOM killer always kills the process it scans first. Because the memory of the initially scanned process is too small, it is difficult to release enough memory after the process is killed, leading to another OOM occurrence quickly.

This problem was triggered in a Kubernetes environment. Kubernetes configures certain important containers as Guaranteed (with an oom_score_adj of -998), to prevent the system from killing those important containers during an OOM. However, if an OOM occurs inside a container, it triggers this kernel bug and causes the process that was first scanned to be always killed.

To address this kernel bug, I have also contributed a patch to the community (mm, oom: make the calculation of oom badness more accurate) to fix the issue of not selecting the appropriate process. In the commit log of this patch, I have described the problem in detail. If you are interested, you can take a look.

Class Summary #

In this class, we learned about tmpfs, a type of memory leak, and how to observe it. The biggest difference between this type of memory leak and other process memory leaks is that it is difficult to determine the source of the leak by looking at the memory consumed by the process, because this type of memory does not reflect in the process’s RES. However, if you are familiar with the standard analysis methods for memory issues, you can quickly find the problem.

When you are unsure which process is consuming the memory, you can use /proc/meminfo to find out which type of memory has a significant overhead, and then perform targeted analysis on that type of memory.
You need to configure appropriate OOM (Out of Memory) policies (oom_score_adj) to prevent important business processes from being killed too early (for example, by setting a lower negative value for oom_score_adj of important processes). At the same time, you also need to consider the accidental killing of other processes. You can compare the /proc/[pid]/oom_score of different processes to determine the order in which they are killed.
Once again, it is emphasized that you need to study the kernel, but at the same time, you should also maintain a skeptical attitude towards the kernel.

In summary, the more you understand the characteristics of different types of memory, the more efficient you will be in analyzing memory issues (such as memory leaks). By mastering these different memory types, you can also choose the appropriate memory type when your business needs to allocate memory.

Homework #

Please run several programs with different oom_score_adj values, record their oom_scores, and then trigger OOM by consuming system memory. See how oom_score and the order of processes being killed are related. I welcome your discussion in the comments.

Thank you for reading. If you found this lesson helpful, please feel free to share it with your friends. See you in the next lecture.