23 Fundamentals How Linux File System Works

23 Fundamentals How Linux File System Works #

Hello, I’m Ni Pengfei.

Based on the previous lessons on CPU and memory, I believe you have already mastered performance analysis and optimization strategies for CPU and memory. From this section on, we will move on to the next important module: file systems and disk I/O performance.

Similar to CPU and memory, disk and file system management are also core functions of operating systems.

Disks provide the most basic form of persistent storage for systems.
On top of disks, file systems provide a hierarchical structure for managing files.

So, how do disks and file systems work? And what metrics can be used to measure their performance?

Today, I will show you the working principles of the Linux file system. We will learn about disk operation principles in the next section.

Index Nodes and Directory Entries #

A file system is a mechanism for organizing and managing files on a storage device. Different organization methods result in different file systems.

You must remember the most important point: in Linux, everything is a file. Not only ordinary files and directories, but also block devices, sockets, pipes, etc., are all managed through a unified file system.

For convenience of management, Linux file systems allocate two data structures for each file: the index node (inode) and the directory entry. They are mainly used to record file metadata and directory structure.

The index node, or inode for short, is used to record file metadata such as inode number, file size, access permissions, modification date, and data location. The index node and the file are one-to-one correspondence. Like file contents, they are also persistently stored on the disk. So remember, index nodes also occupy disk space.
The directory entry, or dentry for short, is used to record the file name, index node pointer, and relationships with other directory entries. Multiple associated directory entries form the directory structure of the file system. However, different from the index node, the directory entry is a memory data structure maintained by the kernel, so it is usually called the directory entry cache.

In other words, the index node is the unique identifier for each file, while the directory entry maintains the tree structure of the file system. The relationship between the directory entry and the index node is many-to-one. You can simply understand that a file can have multiple aliases.

For example, aliases created for a file through hard links correspond to different directory entries, but these directory entries essentially link to the same file, so their index nodes are the same.

Index nodes and directory entries record file metadata and directory relationships. So specifically, how is file data stored? Can it be directly written to the disk?

In reality, the minimum unit for disk read and write operations is a sector, but a sector is only 512 bytes in size. If such a small unit is read and written each time, the efficiency will be very low. Therefore, the file system groups continuous sectors into logical blocks and manages data using logical blocks as the minimum unit. The common size of logical blocks is 4KB, which is composed of 8 continuous sectors.

To help you understand the relationship between directory entries, index nodes, and file data, I have drawn an illustrative diagram. You can refer to this diagram to review the content just discussed and connect the knowledge and details together.

File System Structure

However, there are two points here that you need to pay attention to.

First, the directory entry itself is a memory cache, while the index node is data stored on the disk. In the previous explanation of the Buffer and Cache principles, I mentioned that to coordinate the performance differences between slow disks and fast CPUs, file content is cached in the page cache.

Therefore, you should realize that these index nodes will also naturally be cached in memory to accelerate file access.

Second, when the disk is formatted for a file system, it is divided into three storage areas: the superblock, the index node area, and the data block area. Among them,

The superblock stores the state of the entire file system.
The index node area is used to store index nodes.
The data block area is used to store file data.

Virtual File System #

Directory entries, index nodes, logical blocks, and superblocks make up the four basic elements of the Linux file system. However, in order to support various different file systems, the Linux kernel introduces an abstraction layer between user processes and file systems, called the Virtual File System (VFS).

VFS defines a set of data structures and standard interfaces that are supported by all file systems. This way, user processes and other subsystems in the kernel only need to interact with the unified interface provided by VFS, without having to worry about the implementation details of underlying file systems.

Here, I have drawn an architecture diagram of the Linux file system to help you better understand the relationship between system calls, VFS, cache, file systems, and block storage.

Linux File System Architecture

From this diagram, you can see that below VFS, Linux supports various file systems such as Ext4, XFS, NFS, and so on. These file systems can be categorized into three types based on their storage locations.

The first type is disk-based file systems, which store data directly on locally mounted disks. Common examples of this type of file systems include Ext4, XFS, OverlayFS, etc.
The second type is memory-based file systems, also known as virtual file systems. These file systems do not require any disk allocation for storage space but occupy memory. The commonly used /proc file system is actually a type of virtual file system. In addition, the /sys file system also belongs to this type and primarily exports hierarchical kernel objects to the user space.
The third type is network file systems, which are used to access data on other computers. Examples of this type of file systems include NFS, SMB, iSCSI, etc.

These file systems need to be mounted to a subdirectory (called a mount point) in the VFS directory tree before their files can be accessed. Taking the first type, disk-based file systems as an example, during system installation, a root directory (/) needs to be mounted first, and then other file systems (such as other disk partitions, /proc file system, /sys file system, NFS, etc.) are mounted under the root directory.

File System I/O #

After mounting the file system to the mount point, you can access the files managed by it by using the mount point. VFS (Virtual File System) provides a set of standard file access interfaces, which are provided to applications through system calls.

Take the cat command as an example. It first calls open(), opens a file; then calls read(), reads the content of the file; and finally calls write(), outputs the file content to the standard output of the console:

int open(const char *pathname, int flags, mode_t mode);
ssize_t read(int fd, void *buf, size_t count);
ssize_t write(int fd, const void *buf, size_t count);

The different ways of file read and write result in various classifications of I/O. The most common ones include buffered and unbuffered I/O, direct and non-direct I/O, blocking and non-blocking I/O, synchronous and asynchronous I/O. Let’s take a closer look at these four classifications.

Firstly, based on whether or not the standard library cache is used, file I/O can be divided into buffered I/O and unbuffered I/O.

Buffered I/O refers to using the standard library cache to accelerate access to the file, and the standard library further accesses the file through system scheduling.
Unbuffered I/O refers to directly accessing the file through system calls without going through the standard library cache.

Note that here the term “buffer” refers to the cache implementation inside the standard library. For example, you may have seen many programs that only output when encountering a newline, and the content before the newline is temporarily cached by the standard library.

Whether it is buffered I/O or unbuffered I/O, they ultimately still need to access the file through system calls. And as we learned in the previous section, after the system call, page caching is used to reduce disk I/O operations.

Secondly, based on whether or not the operating system’s page cache is used, file I/O can be divided into direct I/O and non-direct I/O.

Direct I/O refers to bypassing the operating system’s page cache and directly interacting with the file system to access the file.
Non-direct I/O, on the other hand, means that when reading or writing a file, it first goes through the system’s page cache, and then is actually written to the disk by the kernel or additional system calls.

To implement direct I/O, you need to specify the O_DIRECT flag in the system call. If it is not set, non-direct I/O is the default.

However, it is worth noting that direct I/O and non-direct I/O essentially still interact with the file system. In scenarios such as databases, you will see cases where file system is bypassed and the disk is read or written directly, which is commonly referred to as raw I/O.

Thirdly, based on whether or not the application blocks its own execution, file I/O can be divided into blocking I/O and non-blocking I/O:

Blocking I/O refers to the situation where the application program blocks the current thread if it does not receive a response after performing an I/O operation, and thus cannot execute other tasks.
Non-blocking I/O refers to the situation where the application program does not block the current thread after performing an I/O operation, and can continue to execute other tasks. The result of the call is then obtained through polling or event notification.

For example, when accessing a pipe or network socket, setting the O_NONBLOCK flag indicates non-blocking access. If no settings are made, the default is blocking access.

Fourthly, based on whether or not to wait for the response result, file I/O can be divided into synchronous and asynchronous I/O:

Synchronous I/O means that the application program has to wait until the entire I/O operation is completed before it can obtain the I/O response.
Asynchronous I/O means that the application program does not need to wait for the completion or response of the I/O operation, but can continue to execute. The response will be notified to the application program through event notification when the I/O is completed.

For example, when operating on a file, if you set the O_SYNC or O_DSYNC flag, it represents synchronous I/O. If you set O_DSYNC, you need to wait for the file data to be written to the disk before returning; while O_SYNC is based on O_DSYNC, requiring the file metadata to be written to the disk before returning.

Another example is when accessing a pipe or network socket, if the O_ASYNC option is set, the corresponding I/O is asynchronous I/O. In this case, the kernel will notify the process whether the file is readable or writable through SIGIO or SIGPOLL.

You may have noticed that many of these concepts frequently appear in network programming as well. For example, non-blocking I/O is usually used with select or poll for I/O operations on network sockets.

You should also understand the profound meaning of “Everything is a file in Linux”. Whether it is a regular file, block device, network socket, or pipe, they are all accessed through a unified VFS interface.

Performance Observation #

After learning so many principles of file systems, you must be eager to get started and observe the performance of file systems.

Next, open a terminal, SSH into the server, and let me show you how to observe the performance of file systems.

Capacity #

For file systems, one of the most common issues is insufficient space. Of course, you may already know that you can use the df command to check the disk space usage of a file system. For example:

$ df /dev/sda1
Filesystem     1K-blocks    Used Available Use% Mounted on
/dev/sda1       30308240 3167020  27124836  11% /

You can see that my root file system is only using 11% of the space. Here, it is worth noting that the total space is represented by the number of 1K-blocks. You can add the -h option to df to get a more readable format:

$ df -h /dev/sda1
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1        29G  3.1G   26G  11% /

However, sometimes you may encounter an insufficient space problem, but when you check the disk space with df, you find that there is still a lot of available space. What’s the reason?

Do you remember the detail I mentioned earlier? In addition to file data, index nodes also occupy disk space. You can add the -i parameter to df command to check the usage of index nodes, as shown below:

$ df -i /dev/sda1
Filesystem      Inodes  IUsed   IFree IUse% Mounted on
/dev/sda1      3870720 157460 3713260    5% /

The capacity of index nodes (i.e., the number of inodes) is set during disk formatting and is typically automatically generated by the formatting tool. When you find that there is insufficient space for index nodes but sufficient disk space, it is likely caused by too many small files.

Therefore, generally speaking, deleting these small files or moving them to another disk with sufficient index nodes can solve this problem.

Cache #

In the previous Cache example, I have already explained that you can use free or vmstat to observe the size of page cache. As a review, the Cache output from free represents the sum of page cache and reclaimable Slab cache. You can directly get their sizes from /proc/meminfo, for example:

$ cat /proc/meminfo | grep -E "SReclaimable|Cached"
Cached:           748316 kB
SwapCached:            0 kB
SReclaimable:     179508 kB

Now, let’s talk about how to observe the directory entry and index node cache in the file system.

In fact, the kernel uses the Slab mechanism to manage the cache of directory entries and index nodes. /proc/meminfo only provides the overall size of Slab caches, and to see the specific information for each Slab cache, you can refer to the /proc/slabinfo file.

For example, running the following command will give you the cache status of all directory entries and various file system index nodes:

$ cat /proc/slabinfo | grep -E '^#|dentry|inode'
# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
xfs_inode              0      0    960   17    4 : tunables    0    0    0 : slabdata      0      0      0
...
ext4_inode_cache   32104  34590   1088   15    4 : tunables    0    0    0 : slabdata   2306   2306      0hugetlbfs_inode_cache     13     13    624   13    2 : tunables    0    0    0 : slabdata      1      1      0
sock_inode_cache    1190   1242    704   23    4 : tunables    0    0    0 : slabdata     54     54      0
shmem_inode_cache   1622   2139    712   23    4 : tunables    0    0    0 : slabdata     93     93      0
proc_inode_cache    3560   4080    680   12    2 : tunables    0    0    0 : slabdata    340    340      0
inode_cache        25172  25818    608   13    2 : tunables    0    0    0 : slabdata   1986   1986      0
dentry             76050 121296    192   21    1 : tunables    0    0    0 : slabdata   5776   5776      0

In this output, the dentry line represents the cache of directory entries, and the inode_cache line represents the cache of VFS index nodes. The rest are cache entries for various file systems.

/proc/slabinfo has a lot of columns, and you can refer to the man slabinfo command for the specific meanings. In actual performance analysis, we often use slabtop to find the cache type that consumes the most memory.

For example, the following is the result I obtained by running slabtop:

# Press 'c' to sort by cache size, press 'a' to sort by number of active objects
$ slabtop
Active / Total Objects (% used)    : 277970 / 358914 (77.4%)
Active / Total Slabs (% used)      : 12414 / 12414 (100.0%)
Active / Total Caches (% used)     : 83 / 135 (61.5%)
Active / Total Size (% used)       : 57816.88K / 73307.70K (78.9%)
Minimum / Average / Maximum Object : 0.01K / 0.20K / 22.88K

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
69804  23094   0%    0.19K   3324       21     13296K dentry
16380  15854   0%    0.59K   1260       13     10080K inode_cache
58260  55397   0%    0.13K   1942       30      7768K kernfs_node_cache
   485    413   0%    5.69K     97        5      3104K task_struct
  1472   1397   0%    2.00K     92       16      2944K kmalloc-2048

From this result, you can see that in my system, directory entries and index nodes occupy the largest Slab cache. However, they actually occupy a relatively small amount of memory, totaling only about 23MB.

Summary #

Today, I have described the working principles of the Linux file system.

The file system is a mechanism for organizing and managing files on storage devices. In order to support various types of file systems, Linux abstracts a layer of virtual file system (VFS) in various file system implementations.

To mitigate the impact of slow disks on performance, the file system uses page cache, directory entry cache, and inode cache to alleviate the impact of disk latency on applications.

In terms of performance monitoring, today we mainly discussed capacity and caching metrics. In the next section, we will learn about the working principles of Linux disk I/O and master the performance monitoring methods for disk I/O.

Reflection #

Finally, I have a question for you to ponder. In practical work, we often search for the path of a file based on its name, like this:

$ find / -name file-name

The question for today is, does this command increase the system cache? If so, what type of cache is affected? You can try it out and analyze the results yourself, and see if they match your analysis.

Feel free to discuss with me in the comments, and feel free to share this article with your colleagues and friends. Let’s practice and improve together through communication.