21 How Tos How to Quickly and Accurately Find System Memory Issues

21 How-tos How to Quickly and Accurately Find System Memory Issues #

Hello, I’m Ni Pengfei.

In the previous sections, we analyzed various common memory performance issues using several cases. I believe that through these cases, you have gained a basic understanding of memory performance analysis and have become familiar with many memory performance analysis tools. You must be wondering if there is a method to quickly pinpoint memory issues. And once the bottleneck in memory is identified, what are the approaches for optimizing memory?

Today, I will help you organize the process of quickly identifying system memory and summarize the corresponding problem-solving approaches.

Memory Performance Metrics #

To analyze the performance bottleneck of memory, first you need to know how to measure the performance of memory, that is, the issue of performance metrics. Let’s review the memory performance metrics we have learned in the previous sections.

You can first find a piece of paper and write it down from memory, or open the previous articles and summarize it yourself.

First, the easiest thing you can think of is the system memory usage, such as used memory, available memory, shared memory, free memory, cache usage, and buffer usage, etc.

Used memory and available memory are easy to understand, they are the memory that has been used and the memory that has not been used.
Shared memory is implemented through tmpfs, so its size is the memory size used by tmpfs. In fact, tmpfs is also a special cache.
Available memory is the maximum memory that a new process can use, including free memory and reclaimable cache.
Cache includes two parts. One part is the page cache for disk file read, which is used to cache data read from disk and can speed up future accesses. The other part is the reclaimable memory in the Slab allocator.
Buffer is the temporary storage for the original disk blocks, used to cache data that is about to be written to disk. This way, the kernel can concentrate scattered writes for optimized disk writes.

The second category that comes to mind easily should be the process memory usage, such as process virtual memory, resident memory, shared memory, and swap memory, etc.

Virtual memory includes process code segment, data segment, shared memory, allocated heap memory, and swapped out memory, etc. Note that even if the allocated memory has not been assigned physical memory, it is also considered virtual memory.
Resident memory is the actual physical memory used by the process. However, it does not include swap memory and shared memory.
Shared memory includes both real shared memory used with other processes, and loaded dynamic link libraries and code segments of the program.
Swap memory refers to memory that has been swapped out to disk through swap.

Of course, among these metrics, resident memory is usually converted to the percentage of the total system memory, that is, the memory usage of the process.

In addition to these easily thinkable metrics, I also want to emphasize page faults.

In the principle of memory allocation, I have mentioned that after the system calls for memory allocation requests, physical memory is not immediately allocated for them. It is allocated through page faults when the request is accessed for the first time. Page faults can be divided into the following two scenarios.

When it can be directly allocated from physical memory, it is called a minor page fault.
When disk I/O intervention is required (such as Swap), it is called a major page fault.

Obviously, an increase in major page faults means that disk I/O is required, and memory access will be much slower.

In addition to system memory and process memory, the third important metric is the usage of Swap, such as used space, available space, swap-in speed, and swap-out speed, etc.

Used space and available space are self-explanatory, they are the memory space that has been used and the memory space that has not been used.
Swap-in and swap-out speed represent the size of memory being swapped in and swapped out per second.

These memory performance metrics need to be familiar and be able to use. I have summarized them into a mind map, you can save and print it out, or you can summarize it on your own.

Memory Performance Tools #

After understanding the performance indicators of memory, we need to know how to obtain these indicators, which is the use of performance tools. Here, we will review the various memory performance tools that have been used in previous cases in the same way. I encourage you to recall and summarize them yourself.

First, you should have noticed that free was used in all cases. This is the most commonly used memory tool, which can view the overall memory and Swap usage of the system. Correspondingly, you can use top or ps to view the memory usage of processes.

Then, in the chapter on the principles of cache and buffer, we found the source of memory indicators through the proc file system, and observed the dynamic changes of memory through vmstat. Compared with free, vmstat can not only dynamically view the changes of memory, but also distinguish the sizes of cache and buffer, as well as the size of memory swapped in and out.

Next, in the chapter on cache and buffer cases, in order to understand the cache hit rate, we used cachestat to view the read and write hit rates of the entire system cache, and used cachetop to observe the read and write hit rates of each process’s cache.

Furthermore, in the memory leak case, we used vmstat to discover that memory usage was continuously increasing, and then used memleak to confirm the occurrence of memory leaks. Through the memory allocation stack given by memleak, we found the suspicious location of memory leaks.

Finally, in the Swap case, we used sar to discover the issue of increased buffer and Swap. Through cachetop, we found the root cause of the increased buffer. By comparing the remaining memory with the memory thresholds in /proc/zoneinfo, we found that the increase in Swap was caused by memory reclamation. At the end of the case, we also identified the processes affected by Swap through the /proc file system.

At this point, do you feel the “malice” from the world of performance once again? Why are there so many performance tools? In fact, as mentioned before, understanding the working principles of memory, combining them with performance indicators, and mastering the usage methods of tools is not difficult.

Relationship between Performance Metrics and Tools #

Similar to CPU performance analysis, my experience is to approach the topic from two different dimensions: organization and memorization.

Starting from memory metrics, it is easier to associate tools with the working principles of memory.
Starting from performance tools, we can quickly utilize the tools to identify the performance metrics we want to observe. Especially when the number of tools is limited, it is essential to make full use of each tool at hand and uncover more issues.

Similarly, based on the correspondence between memory performance metrics and tools, I have created two tables to facilitate your understanding and memorization of the relationship. Of course, you can also use them as a guide for “metric tools” and “tool metrics” and directly look up the information when needed.

The first table, starting from memory metrics, lists the performance tools that can provide these metrics. This way, when investigating performance issues, you will know exactly which tools to use for analyzing and obtaining the desired metrics.

The second table, starting from performance tools, organizes the memory metrics that these common tools can provide. By mastering this table, you can maximize the use of existing tools and find as many desired metrics as possible.

You don’t have to memorize the specific usage of these tools. Just knowing which tools are available and the basic metrics they provide is enough. When needed, refer to their manuals for usage instructions.

How to quickly analyze memory performance bottlenecks #

I believe that at this point, you are already very familiar with memory performance metrics and know which tools to use to obtain each metric.

Does that mean that every time you encounter a memory performance issue, you have to run all these tools and analyze all memory performance metrics?

Of course not. As I mentioned in the previous CPU performance section, the brute-force method, although useful and likely to find certain potential bottlenecks in the system, is inefficient and requires a lot of work, so we rejected this method.

The goal in real production environments is to quickly identify system bottlenecks and optimize performance as quickly and accurately as possible.

So is there a method that can quickly and accurately analyze memory problems in the system?

Of course, there is. It’s about finding correlations. Although there are many memory performance metrics, they all describe the principles of memory and are naturally not completely isolated from each other. Generally, there are correlations between metrics. On the other hand, these correlations are also derived from the memory principles of the system. This is why I always emphasize the importance of fundamental principles and intersperse explanations in the article.

Let me give you a simple example. When you see that the remaining memory in the system is very low, does it mean that processes cannot allocate new memory? Of course not, because the memory that processes can use includes not only the remaining memory but also recyclable caches and buffers.

Therefore, to quickly identify memory issues, I usually start by running several performance tools that cover a large scope, such as free, top, vmstat, and pidstat.

The specific analysis steps are as follows:

First, use free and top to check the overall memory usage of the system.
Then, use vmstat and pidstat to observe the trend over a period of time to determine the type of memory problem.
Finally, conduct a detailed analysis, such as memory allocation analysis, cache/buffer analysis, and specific process memory usage analysis.

At the same time, I have illustrated this analysis process as a flowchart, which you can save and print for reference.

Memory Performance Analysis Flowchart

The flowchart lists several commonly used memory tools and their related analysis processes. The arrows represent the direction of analysis. Let me give you a few examples that may help you better understand.

For the first example, when you find that most of the memory is occupied by caches through free, you can use vmstat or sar to observe the changes in caches and confirm whether the cache usage is still increasing.

If the usage continues to increase, it means that the process causing the cache increase is still running, and you can use cache/buffer analysis tools like cachetop and slabtop to analyze where these caches are being used.

For the second example, when you check free and find that the available memory is insufficient, you need to first confirm whether the memory is being occupied by caches/buffers. After excluding caches/buffers, you can continue to use pidstat or top to locate the process consuming the most memory.

Once you identify the process, you can then analyze the memory usage in the process address space using process memory space analysis tools like pmap.

For the third example, if you find that memory is continuously growing through vmstat or sar, you can analyze if there are memory leak issues.

For example, you can use the memory allocation analysis tool memleak to check for memory leaks. If there is a memory leak problem, memleak will output the processes and call stacks that have memory leaks.

Note that I didn’t list all performance tools in this diagram, but only the most essential ones. One reason is that I don’t want to overwhelm you with a long list of tools.

On the other hand, I hope you can focus on the core tools first, master their usage and analysis methods through the examples I provided, and learn from real-world practice in actual environments. After all, by mastering these tools, you can solve most memory problems.

Summary #

In today’s article, I reviewed common memory performance metrics, discussed common memory performance analysis tools, and finally summarized the approach to quickly analyze memory issues.

Although there are many memory performance metrics and tools, once you understand the basic principles of memory management, you will find that they are actually somewhat related. By understanding their relationships, mastering the techniques of memory analysis is not difficult.

Once you have identified the source of the memory issue, the next step is to optimize accordingly. In my opinion, the most important aspect of memory optimization is ensuring that the hot data of the application is stored in memory and minimizing page swaps and swapping.

Here are some common optimization approaches:

It is best to disable swap. If swap is necessary, reducing the value of swappiness can reduce the tendency to use swap during memory reclamation.
Reduce dynamic memory allocation. For example, you can use memory pools, huge pages, etc.
Use caches and buffers to access data as much as possible. For example, you can use stack to explicitly allocate memory space to store data that needs to be cached, or use external caching components like Redis to optimize data access.
Limit the memory usage of processes using methods like cgroups. This way, you can ensure that system memory will not be exhausted by abnormal processes.
Adjust the oom_score of core applications through /proc/pid/oom_adj. This can ensure that even in memory pressure situations, core applications will not be killed by the OOM (out-of-memory) killer.

Reflections #

Due to space limitations, I have only listed a few important memory metrics and analysis approaches that I consider. I believe you have also encountered many memory performance issues. So I would like to chat with you about the memory performance issues you have dealt with and how you analyzed and resolved them. During this process, did you encounter any pitfalls or important insights?

Feel free to discuss with me in the comments section, and also feel free to share this article with your colleagues and friends. Let’s practice in real-world scenarios and improve through communication.