15 Fundamentals How Does Linux Memory Work

15 Fundamentals How Does Linux Memory Work #

Hello, I’m Ni Pengfei.

In the previous sections, we learned about the performance principles and optimization methods of the CPU. Now, we will move on to another topic - memory.

Similar to CPU management, memory management is also one of the core functions of the operating system. Memory is primarily used to store instructions, data, cache, and other components of the system and applications.

So, how does Linux manage its memory? Today, I will take you through this question.

Memory Mapping #

When it comes to memory, do you know how much memory your computer has? I guess you remember it clearly because it is an important parameter we consider when purchasing a computer. For example, my laptop has 8 GB of memory.

The memory capacity we usually refer to, such as the aforementioned 8 GB, actually refers to physical memory. Physical memory, also known as main memory, is mostly dynamic random access memory (DRAM) used by most computers. Only the kernel can directly access physical memory. So, what happens when a process needs to access memory?

The Linux kernel provides each process with its own independent virtual address space, which is continuous. This way, processes can conveniently access memory, or more precisely, virtual memory.

The virtual address space is divided into two parts: the kernel space and the user space. The range of the address space varies for processors with different word lengths (the maximum length of data that can be processed by a single CPU instruction). For example, in the most common 32-bit and 64-bit systems, I’ve drawn two diagrams to represent their virtual address spaces respectively:

From here, we can see that the kernel space of a 32-bit system occupies 1 GB, located at the highest end, while the remaining 3 GB is the user space. In a 64-bit system, both the kernel space and user space are 128 TB, occupying the highest and lowest parts of the entire memory space, respectively, leaving the middle part undefined.

Do you still remember the user mode and kernel mode of a process? When a process is in user mode, it can only access user space memory. Only when it enters kernel mode can it access kernel space memory. Although the address space of each process includes the kernel space, these kernel spaces are actually associated with the same physical memory. This way, when a process switches to kernel mode, it can conveniently access kernel space memory.

Since each process has such a large address space, the total virtual memory of all processes is naturally much larger than the actual physical memory. Therefore, not all virtual memory is allocated physical memory. Only those virtual memory that is actually used will be allocated physical memory, and the allocated physical memory is managed through memory mapping.

Memory mapping is essentially mapping virtual memory addresses to physical memory addresses. To achieve memory mapping, the kernel maintains a page table for each process, which records the mapping relationship between virtual addresses and physical addresses, as shown in the following diagram:

The page table is actually stored in the memory management unit (MMU) of the CPU. In normal circumstances, the processor can find the memory to be accessed directly through hardware.

When the virtual address accessed by a process cannot be found in the page table, the system generates a page fault exception, enters kernel space to allocate physical memory and update the process’s page table, and then returns to user space to resume the process.

In addition, I mentioned in my article on CPU context switching that the Translation Lookaside Buffer (TLB) affects the CPU’s memory access performance, which can actually be explained here.

The TLB is actually a cache of the page table in the MMU. Since the virtual address space of each process is independent and the access speed of the TLB is much faster than that of the MMU, reducing process context switching and the number of TLB flushes can increase the utilization of the TLB cache and improve the CPU’s memory access performance.

However, it should be noted that the MMU does not manage memory in bytes, but specifies the smallest unit of a memory mapping, which is a page, usually 4 KB in size. Therefore, each memory mapping needs to associate with 4 KB or a multiple of 4 KB of memory.

The small size of pages, 4 KB, leads to another problem: the entire page table becomes very large. For example, only a 32-bit system requires over 1 million page table entries (4 GB/4 KB) to map the entire address space. To solve the problem of too many page table entries, Linux provides two mechanisms: multi-level page tables and huge pages.

Multi-level page tables divide memory into blocks for management, replacing the original mapping relationship with block indexes and offsets within the blocks. Since only a small part of the virtual address space is usually used, multi-level page tables only keep these used blocks, which greatly reduces the number of page table entries.

Linux uses a four-level page table to manage memory pages, as shown in the following diagram. The virtual address is divided into five parts, the first four entries are used to select a page, and the last index represents the page offset.

Let’s take a look at huge pages, as the name suggests, they are larger memory blocks than regular pages, commonly sized as 2 MB and 1 GB. Huge pages are usually used in processes that require a large amount of memory, such as Oracle, DPDK, etc.

With these mechanisms and the mapping of the page table, processes can access physical memory through virtual addresses. Now, specifically within a Linux process, how is this memory used?

Distribution of Virtual Memory Space #

Firstly, we need to further understand the distribution of virtual memory space. The kernel space at the top does not need much explanation, while the user space memory below is actually divided into multiple different segments. Taking a 32-bit system as an example, I have drawn a diagram to represent their relationship.

Through this diagram, you can see that the user space memory is divided into five different memory segments from low to high.

The read-only segment, including code and constants, etc.
The data segment, including global variables, etc.
The heap, including dynamically allocated memory, which grows from low addresses to high addresses.
The file mapping segment, including dynamic libraries, shared memory, etc., which grows from high addresses to low addresses.
The stack, including local variables and the context of function calls. The size of the stack is fixed, generally 8 MB.

Among these five memory segments, the memory in the heap and the file mapping segment is dynamically allocated. For example, by using the C standard library’s malloc() or mmap(), you can dynamically allocate memory in the heap and the file mapping segment respectively.

In fact, the memory distribution in a 64-bit system is similar, except that the memory space is much larger. So, the more important question is: how is the memory actually allocated?

Memory Allocation and Deallocation #

malloc() is a memory allocation function provided by the C standard library. There are two implementation methods for this function when it comes to system calls, namely brk() and mmap().

For small memory blocks (less than 128K), the C standard library uses brk() for allocation, which means allocating memory by moving the top of the heap. After these memory blocks are freed, they are not immediately returned to the system, but instead cached for reuse.

On the other hand, for large memory blocks (greater than 128K), mmap() is directly used for allocation. It allocates a block of free memory from the mapped file segment.

These two methods naturally have their own advantages and disadvantages.

The cache used by the brk() method can reduce the occurrence of page faults and improve memory access efficiency. However, due to the fact that these memory blocks are not returned to the system, frequent memory allocation and deallocation during busy memory operations can cause memory fragmentation.

In contrast, the memory allocated by the mmap() method is directly returned to the system upon deallocation, so a page fault occurs every time mmap() is called. When there is heavy memory activity, frequent memory allocation can result in a large number of page faults, increasing the kernel’s management burden. This is why malloc() only uses mmap() for large memory blocks.

After understanding these two calling methods, we also need to understand that when these calls occur, memory is not actually allocated. These memory blocks are allocated only when first accessed, which means entering the kernel through a page fault exception and then allocating memory by the kernel.

Overall, Linux uses the buddy system to manage memory allocation. As mentioned earlier, these memory blocks are managed in the MMU in units of pages. The buddy system also manages memory in units of pages and reduces memory fragmentation by merging adjacent pages (such as memory fragmentation caused by the brk() method).

You may wonder, what if there are objects smaller than a page, like less than 1K?

In actual system operation, there are indeed many objects that are smaller than a page. If individual pages are allocated for them, a lot of memory will be wasted.

Therefore, in the user space, the malloc() function caches the memory allocated by brk() and reuses it instead of returning it to the system immediately. In the kernel space, Linux uses the slab allocator to manage small memory. You can think of the slab as a cache built on top of the buddy system, mainly used to allocate and deallocate small objects in the kernel.

For memory, if it is allocated but not deallocated, it will cause memory leaks and even exhaust the system memory. Therefore, after the application program finishes using memory, it needs to call free() or unmap() to release the unused memory.

Of course, the system will not allow a process to use up all the memory. When the system detects memory shortage, it will reclaim memory through a series of mechanisms, including the following three:

Reclaiming cache, for example, using the Least Recently Used (LRU) algorithm to reclaim the least recently used memory pages;
Reclaiming infrequently accessed memory by writing infrequently used memory to the swap partition directly;
Killing processes. When memory is scarce, the system will directly kill processes that occupy a large amount of memory using the Out of Memory (OOM) mechanism.

In the second method mentioned, when reclaiming infrequently accessed memory, the swap partition is used. Swap essentially uses a portion of the disk space as memory. It stores data that a process temporarily does not use on the disk (this process is called swapping out), and when the process accesses this memory again, it reads the data from the disk to the memory (this process is called swapping in).

Therefore, you can see that Swap increases the available memory of the system. However, it is worth noting that swap only occurs when memory is not sufficient. Moreover, due to the much slower speed of disk read and write compared to memory, swap can cause severe memory performance issues.

The third method mentioned, OOM (Out of Memory), is actually a protection mechanism of the kernel. It monitors the memory usage of processes and uses oom_score to score the memory usage of each process:

The larger the memory consumed by a process, the larger the oom_score;
The more CPU resources consumed by a process during execution, the smaller the oom_score.

In this way, the larger the oom_score of a process, the more memory it consumes, and the easier it is to be killed by OOM, which can better protect the system.

Of course, for practical needs, administrators can manually set the oom_adj of a process through the /proc file system to adjust the oom_score of the process.

The range of oom_adj is [-17, 15]. The larger the value, the easier the process is to be killed by OOM; the smaller the value, the less likely the process is to be killed by OOM. In this range, -17 means OOM is disabled.

For example, with the following command, you can set the oom_adj of the sshd process to a smaller value of -16, so that the sshd process is less likely to be killed by OOM.

echo -16 > /proc/$(pidof sshd)/oom_adj

How to View Memory Usage #

By understanding the distribution of memory space, memory allocation and recycling, you should have a general understanding of how memory works. Of course, the actual operation of the system is more complex and involves other mechanisms. Here, I will only discuss the main principles. With this knowledge, you can have a basic understanding of how memory works and not just a collection of terminology.

So after understanding the working principles of memory, how can we check the system’s memory usage?

Actually, in the previous section on CPU, we also mentioned some related tools. Here, the first tool that comes to mind should be free. Below is an example output of free:

# Note that the output of 'free' may vary in different versions
$ free
              total        used        free      shared  buff/cache   available
Mem:        8169348      263524     6875352         668     1030472     7611064
Swap:             0           0           0

You can see that the output of free is a table, with the values defaulting to bytes. The table has two rows and six columns. The first row represents the usage of physical memory, Mem, and swap partition, Swap. And in the six columns, each column represents:

The first column, total, represents the total memory size;
The second column, used, represents the size of used memory, including shared memory;
The third column, free, represents the size of unused memory;
The fourth column, shared, represents the size of shared memory;
The fifth column, buff/cache, represents the size of buffers and caches;
The last column, available, represents the size of available memory for new processes.

Pay special attention to the last column, available. available includes not only unused memory, but also recoverable cache, so it is generally larger than unused memory. However, not all caches can be reclaimed because some caches may still be in use.

However, we know that free displays the overall memory usage of the system. If you want to check the memory usage of processes, you can use tools like top or ps. For example, here is an example output of top:

# Press M to switch to sorting by memory
$ top
...
KiB Mem :  8169348 total,  6871440 free,   267096 used,  1030812 buff/cache
KiB Swap:        0 total,        0 free,        0 used.  7607492 avail Mem


  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
  430 root      19  -1  122360  35588  23748 S   0.0  0.4   0:32.17 systemd-journal
 1075 root      20   0  771860  22744  11368 S   0.0  0.3   0:38.89 snapd
 1048 root      20   0  170904  17292   9488 S   0.0  0.2   0:00.24 networkd-dispat
    1 root      20   0   78020   9156   6644 S   0.0  0.1   0:22.92 systemd
12376 azure     20   0   76632   7456   6420 S   0.0  0.1   0:00.01 systemd
12374 root      20   0  107984   7312   6304 S   0.0  0.1   0:00.00 sshd
...

At the top of the top output, it also shows the overall memory usage of the system, which is similar to free, so I won’t explain it again. Let’s continue to look at the content below, the columns related to memory such as VIRT, RES, SHR, and %MEM.

These columns include several important memory usage information of the process, let’s look at them one by one.

VIRT represents the virtual memory size of the process. It includes memory that has been requested by the process but not yet allocated physical memory.
RES represents the resident memory size, which is the actual physical memory used by the process, excluding swap and shared memory.
SHR represents the shared memory size, which includes shared memory used by other processes, loaded dynamic link libraries, and the code segment of the program.
%MEM represents the percentage of physical memory used by the process out of the total system memory.

In addition to understanding this basic information, when viewing the top output, there are two more points to note.

First, virtual memory usually does not allocate all physical memory. From the above output, you can see that the virtual memory of each process is much larger than the resident memory.

Second, shared memory (SHR) is not necessarily shared. For example, the program’s code segment and non-shared dynamic link libraries are also included in SHR. Of course, SHR includes memory that is truly shared between processes. So when calculating the memory usage of multiple processes, do not directly add up the SHR of all processes to obtain the result.

Summary #

Today, we have summarized the working principles of Linux memory. For a normal process, it can only see the virtual memory provided by the kernel, and these virtual memory need to be mapped to physical memory by the system through page tables.

When a process requests memory through malloc(), the memory is not allocated immediately. Instead, it is allocated by the kernel when the memory is accessed for the first time through a page fault exception.

Since the virtual address space of a process is much larger than physical memory, Linux provides a series of mechanisms to deal with memory shortage, such as cache recycling, swap partition, and OOM.

When you need to understand the memory usage of the system or a process, you can use performance tools such as free, top, and ps. They are the most commonly used performance tools for analyzing performance issues. I hope you can use them proficiently and truly understand the meaning of each indicator.

Reflection #

Finally, I would like to hear your understanding of Linux memory. What memory-related performance bottlenecks have you encountered? How do you analyze them? You can combine the knowledge and principles of memory that you have learned today to present your own views.

Feel free to discuss with me in the comments, and also feel free to share this article with your colleagues and friends. Let’s practice in real scenarios and progress through communication.