22 Q& a the Difference Between File System and Hard Disk

22 Q&A The Difference Between File System and Hard Disk #

Hello, I’m Ni Pengfei.

Up to now, the column has been updated, and we have already completed the second module of the four basic modules - Memory Performance. I’m delighted that you’re still actively learning and practicing, and leaving comments and discussing enthusiastically.

In these comments, I’m very happy to see that many students have used the thinking process learned from the examples to solve performance problems in their actual work. I also greatly appreciate the active thinking of students such as espzest, DaTianCai, and Smile, who pointed out certain inappropriate or inaccurate parts of the article. Additionally, students like LaiYe, JohnT3e, and BaiHua have actively discussed and shared valuable experience in the comments section about learning and practice. I am very grateful to all of you.

Today is the third installment of performance optimization. As usual, I have selected some typical questions from the comments on the memory module as the content for today’s Q&A, and I will reply to them collectively. In order to facilitate your understanding, they are not arranged strictly in the order of the articles.

For each question, I have attached a screenshot of the comment. If you need to review the original content, you can scan the QR code at the lower right corner of each question.

Question 1: Memory Reclamation and OOM #

This question actually includes four sub-questions, namely:

How to understand LRU memory reclamation?
Where does the reclaimed memory go?
Is OOM scored based on virtual memory or physical memory?
How to estimate the minimum memory for an application?

In fact, in the articles “Understanding Memory in Linux” and “Understanding Swap” of Geek Time, I have mentioned that once the system detects memory pressure, it will reclaim memory in three ways. Let’s review them again. These three ways are:

Reclaiming cache based on the LRU (Least Recently Used) algorithm.
Reclaiming infrequently accessed anonymous pages based on the Swap mechanism.
Killing processes that consume a large amount of memory based on the OOM (Out of Memory) mechanism.

The first two ways, cache reclamation and Swap reclamation, are actually based on the LRU algorithm, which prioritizes reclaiming infrequently accessed memory. The LRU reclaim algorithm maintains two doubly linked lists: active and inactive, where:

Active records the active memory pages.
Inactive records the inactive memory pages.

The closer a memory page is to the tail of the linked list, the less frequently it is accessed. In this way, when reclaiming memory, the system can prioritize reclaiming inactive memory based on its activity level.

Active and inactive memory pages are further divided into file pages and anonymous pages according to their types, corresponding to cache reclamation and Swap reclamation, respectively.

Of course, you can check their sizes from /proc/meminfo. For example:

# grep keeps the metrics that contain "active" (case-insensitive)
# sort sorts them alphabetically
$ cat /proc/meminfo | grep -i active | sort
Active(anon):     167976 kB
Active(file):     971488 kB
Active:          1139464 kB
Inactive(anon):      720 kB
Inactive(file):  2109536 kB
Inactive:        2110256 kB

The third way, the OOM mechanism, ranks processes based on their oom_score. The larger the oom_score, the easier it is for the process to be killed by the system.

When the system detects insufficient memory to allocate new memory requests, it will attempt to reclaim memory directly. In this case, if there is enough memory after reclaiming file pages and anonymous pages, the reclaimed memory can be allocated to processes. However, if there is still insufficient memory, OOM will come into play.

When OOM occurs, you can see the “Out of memory” information in dmesg, which tells you which processes have been killed by OOM. For example, you can execute the following command to query the OOM log:

$ dmesg | grep -i "Out of memory"
Out of memory: Kill process 9329 (java) score 321 or sacrifice child

Of course, if you don’t want an application to be killed by OOM, you can adjust the process’s oom_score_adj to reduce its OOM score and reduce the probability of being killed. Alternatively, you can enable memory overcommitment, allowing processes to allocate virtual memory that exceeds physical memory (assuming that the process will not exhaust the allocated virtual memory).

These are the three ways we have reviewed. Now, let’s go back to the four original questions, and I believe you already have the answers.

The principle of the LRU algorithm has been mentioned above, so it will not be repeated here.
After memory reclamation, the reclaimed memory will be put back into the pool of unused memory. This allows new processes to request and use them.
The triggering of OOM is based on virtual memory. In other words, if the sum of a process’s requested virtual memory and the actual memory used by the server is larger than the total physical memory, OOM will be triggered.
To determine the minimum memory for a process or container, the simplest way is to let it run and then check its memory usage through ps or smap. However, please note that when a process is just started, it may not have started processing actual business logic. Once it starts processing actual business logic, it will consume more memory. So, remember to leave some margin for memory.

Question 2: Difference between File System and Disk #

I will explain the principles of the file system and the disk in the next module, as they are closely related to memory. However, when I talked about the principles of Buffer and Cache, I mentioned that Buffer is used for the disk, while Cache is used for files. This has confused many students, such as the two questions raised by JJ:

When reading and writing files, are we ultimately reading and writing the disk? How do we distinguish between reading and writing files or reading and writing the disk?
Can we read and write the disk without going through the file system?

If you also have the same questions, it is mainly because you haven’t understood the difference between the disk and the file. I briefly replied to this in the comments section of the article “Understanding Buffer and Cache in Memory”, but I’m afraid some students might have missed it, so I will explain it again here.

The disk is a storage device (more precisely, a block device) that can be divided into different disk partitions. On the disk or disk partition, a file system can be created and mounted to a directory in the system. This way, the system can read and write files through this mounted directory.

In other words, the disk is a block device that stores data and acts as a carrier for the file system. Therefore, the file system does need to use the disk to ensure persistent storage of data.

You will see this statement in many places: “In Linux, everything is a file.” In other words, you can access the disk and the file through the same file interface (such as open, read, write, close, etc.).

When we say “file,” we usually mean a regular file.
The disk or partition refers to a block device file.

You can use the command “ls -l ” to view the differences between them. If you don’t understand the output of “ls,” don’t forget to check the manual. Run the command “man ls” and “info ‘(coreutils) ls invocation’” to find the information.

When reading and writing regular files, the I/O requests first go through the file system, which then interacts with the disk. However, when reading and writing block device files, the file system is skipped, and the interaction with the disk is direct, which is known as “raw I/O.”

These two methods of reading and writing use different caches. The cache managed by the file system is actually part of the Cache. The buffer used for the raw disk is precisely the Buffer.

Don’t worry too much about the principles of the file system, disk, and I/O. We will cover them later on.

Question 3: How to calculate the physical memory usage of all processes #

This is actually a post-exercise question from the article How to Understand Buffer and Cache in Memory. A few students, such as Anonymous, Griffin, and JohnT3e, provided some ideas.

For example, Anonymous’s method is to add up the RSS of all processes:

This method actually leads to duplicate calculations in many places. RSS represents resident memory and includes shared memory used by processes. Therefore, directly adding them up will result in duplicate calculation of shared memory, which cannot give accurate answers.

Several students’ answers in the comments have similar problems. You can re-check your own method, understand the definition and principles of each indicator, and prevent duplicate calculations.

Of course, some students have very correct ideas, such as JohnT3e mentioned, the key to this question is understanding the meaning of PSS.

Of course, you can find the answer on stackexchange through the link. However, I still recommend directly checking the documentation of the proc filesystem:

The “proportional set size” (PSS) of a process is the count of pages it has in memory, where each page is divided by the number of processes sharing it. So if a process has 1000 pages all to itself, and 1000 shared with one other process, its PSS will be 1500.

Here’s a simple explanation: the PSS of each process is the result of evenly distributing shared memory among all processes, plus the size of the process’s own non-shared memory.

Take the example mentioned in the documentation: one process has 1000 pages of non-shared memory, and it shares another 1000 pages with another process. Then its PSS will be 1000/2 + 1000 = 1500 pages.

In this way, you can directly add up the PSS without worrying about duplicate calculations of shared memory.

For example, you can run the following command to calculate it:

# Use grep to find the Pss indicator, and then use awk to calculate the cumulative value
$ grep Pss /proc/[1-9]*/smaps | awk '{total+=$2}; END {printf "%d kB\n", total }'
391266 kB

Question 4: How to install bcc-tools on CentOS #

Many students mentioned that they are using CentOS system. Although I provided a reference document in the article, the installation of bcc-tools can still be a bit difficult.

For example, Baihua mentioned that the tutorials online are not very complete and the steps are a bit confusing:

However, Baihua and Duduniao_linux shared their experiences after exploring and practicing, thank you for your sharing.

Here, I will provide a unified reply on how to install bcc-tools in CentOS. Taking CentOS 7 as an example, the installation can be divided into two steps.

Step 1: Upgrade the kernel. You can run the following commands to perform the operation:

# Update the system
yum update -y

# Install ELRepo
rpm --import https://www.elrepo.org/RPM-GPG-KEY-elrepo.org
rpm -Uvh https://www.elrepo.org/elrepo-release-7.0-3.el7.elrepo.noarch.rpm

# Install the new kernel
yum remove -y kernel-headers kernel-tools kernel-tools-libs
yum --enablerepo="elrepo-kernel" install -y kernel-ml kernel-ml-devel kernel-ml-headers kernel-ml-tools kernel-ml-tools-libs kernel-ml-tools-libs-devel

# Update Grub and restart
grub2-mkconfig -o /boot/grub2/grub.cfg
grub2-set-default 0
reboot

# After the restart, confirm that the kernel version has been upgraded to 4.20.0-1.el7.elrepo.x86_64
uname -r

Step 2: Install bcc-tools:

# Install bcc-tools
yum install -y bcc-tools

# Configure the PATH
export PATH=$PATH:/usr/share/bcc/tools

# Verify the successful installation
cachestat

Question 5: Optimization methods for memory leaks #

This is a question I posed in I have a memory leak, how do I locate and handle it. The question goes like this:

At the end of the memory leak case, we fixed the memory leak problem by adding a free() call to release the memory allocated by the fibonacci() function. In the case of this example, are there any other better ways to fix it?

Many students left comments with their own thoughts, which were all very good. Here, I want to give special praise to Guo Jiangwei, whose method is very good:

His idea is to use an array to temporarily store the calculation result instead of dynamic memory allocation. This way, the system can automatically manage these stack memories, and there is no memory leak problem.

This approach of reducing dynamic memory allocation is not only a solution for memory leaks, but also a common memory optimization method. For example, in scenarios that require a large amount of memory, you can consider using stack memory, memory pools, HugePages, and other methods to optimize memory allocation and management.

In addition to these five questions, there is one more point I want to mention. Many students mentioned the issue of tool versions. Indeed, the Linux versions in production environments are often lower, which means many new tools cannot be directly used in production.

However, this does not mean we are powerless. After all, the principles of systems are generally similar. This is actually a view that I have always emphasized.

When learning, it is best to use the latest systems and tools. They can provide you with simpler and more intuitive results, helping you better understand the principles of the system.
After you have mastered these principles, you can then go back and understand the tools and principles in older versions of the system. You will find that even though many tools in older versions may not be as user-friendly, the principles and metrics are similar, and you can still easily learn how to use them.

Finally, feel free to continue writing your questions in the comments, and I will continue to answer them. My goal remains the same, hoping to turn the knowledge in the article into your ability. We not only practice in real-world scenarios, but also progress through communication.