24 Fundamentals How Linux Disk Io Works Part One

24 Fundamentals How Linux Disk IO Works Part One #

Hello, I’m Ni Pengfei.

In the previous section, we learned about the working principles of the Linux file system. Let’s do a quick review. A file system is a mechanism for organizing and managing files on a storage device. In various file system implementations in Linux, an additional layer called the Virtual File System (VFS) is abstracted. It defines a set of data structures and standard interfaces that all file systems support.

This way, for an application, it only needs to interact with the unified interface provided by VFS, without having to worry about the specific implementation of the file system. For specific file systems, as long as they conform to the standard of VFS, they can seamlessly support various applications.

Internally, VFS manages files through data structures such as directory entries, index nodes, logical blocks, and superblocks.

Directory entries record the name of a file and the directory relationship between it and other directory entries.
Index nodes record the metadata of a file.
Logical blocks are the smallest read/write units composed of consecutive disk sectors and are used to store file data.
Superblocks are used to record the overall status of the file system, such as the usage of index nodes and logical blocks.

Among them, directory entries are memory caches, while superblocks, index nodes, and logical blocks are persistent data stored on the disk.

So, let’s delve further into how the disk works. What are the performance metrics that can be used to measure its performance?

Next, I will show you the working principles of Linux disk I/O.

Disk #

A disk is a device that can persistently store data. According to the different storage media, common disks can be divided into two categories: mechanical disks and solid-state disks.

The first category, mechanical disk, also known as a hard disk drive (HDD), is usually abbreviated as HDD. A mechanical disk consists mainly of disk platters and read/write heads, and data is stored in circular tracks on the platters. Before reading or writing data, the read/write heads need to be moved and positioned to the desired track where the data is located.

Obviously, if the I/O requests are continuous, there is no need for track seeking, and optimal performance can be achieved. This is what we are familiar with as the working principle of sequential I/O. On the other hand, random I/O needs to constantly move the heads to locate the data, resulting in slower read/write speeds.

The second category, solid-state disk (SSD), is composed of solid-state electronic components. Since solid-state disks do not require track seeking, they have much better performance in both sequential and random I/O compared to mechanical disks.

In fact, both mechanical disks and solid-state disks perform much slower in random I/O compared to sequential I/O due to the following reasons:

For mechanical disks, as mentioned earlier, random I/O requires more head seeking and disk rotation, which naturally results in slower performance than sequential I/O.
For solid-state disks, although their random performance is much better than that of mechanical hard disks, they still have the limitation of “erase before write”. Random read/write operations will cause a large amount of garbage collection, so the performance of random I/O is still much worse compared to sequential I/O.
In addition, sequential I/O can reduce the number of I/O requests through prefetching, which is also one of the reasons for its excellent performance. Many performance optimization schemes are also based on this aspect to optimize I/O performance.

Furthermore, both mechanical disks and solid-state disks have a minimum unit for reading and writing data:

The minimum unit for mechanical disks is a sector, typically with a size of 512 bytes.
The minimum unit for solid-state disks is a page, usually with a size of 4KB, 8KB, etc.

As mentioned in the previous section, if we read and write such small units as 512 bytes each time, the efficiency would be very low. Therefore, the file system combines contiguous sectors or pages into logical blocks, and manages data based on these logical blocks as the minimum unit. The common size of logical blocks is 4KB, which means that eight contiguous sectors or a single page can form a logical block.

In addition to classifying disks based on storage media, another common classification method is based on interfaces, such as IDE (Integrated Drive Electronics), SCSI (Small Computer System Interface), SAS (Serial Attached SCSI), SATA (Serial ATA), FC (Fibre Channel), etc.

Different interfaces often have different device names assigned to them. For example, IDE devices are assigned device names with the “hd” prefix, SCSI and SATA devices are assigned device names with the “sd” prefix. If there are multiple disks of the same type, they will be numbered in alphabetical order, such as a, b, c, etc.

In addition to the classification of disks themselves, when you connect disks to a server and use them in different ways, they can be divided into various architectures.

The simplest architecture is to use disks as independent disk devices. These disks are often divided into different logical partitions according to needs, and each partition is then assigned a number. For example, the /dev/sda that we have used multiple times before can also be divided into two partitions, /dev/sda1 and /dev/sda2.

Another commonly used architecture is to combine multiple disks into a logical disk to form a Redundant Array of Independent Disks (RAID), which can improve data access performance and enhance data storage reliability.

Depending on the capacity, performance, and reliability requirements, RAID can generally be divided into multiple levels, such as RAID0, RAID1, RAID5, RAID10, etc.

RAID0 has the best read/write performance but does not provide data redundancy.
Other levels of RAID provide data redundancy and also optimize read/write performance to a certain extent.

The last architecture is to combine these disks into a network storage cluster and expose them to servers using network storage protocols such as NFS, SMB, iSCSI, etc.

In Linux, disks are actually managed as block devices, which means that data is read and written in blocks and supports random access. Each block device is assigned two device numbers: a major device number and a minor device number. The major device number is used in the driver program to distinguish device types, while the minor device number is used to number multiple devices of the same type.

Generic Block Layer #

Similar to the Virtual File System (VFS) we discussed in the previous section, Linux uses a unified generic block layer to manage various block devices in order to reduce the impact of differences between them.

The generic block layer is an abstraction layer between the file system and disk drivers. It has two main functionalities.

The first functionality is similar to that of the VFS. It provides a standard interface for access to block devices for file systems and applications. It abstracts different heterogeneous disk devices into a unified block device and provides a unified framework for managing the drivers for these devices.
The second functionality is to queue the I/O requests received from the file system and applications and improve disk read and write efficiency through processes such as reordering and request merging.

One of the processes involved in sorting I/O requests is known as I/O scheduling. In fact, the Linux kernel supports four I/O scheduling algorithms: NONE, NOOP, CFQ, and Deadline. Let me introduce them one by one.

The first one, NONE, cannot be considered an I/O scheduling algorithm per se. This is because it does not use any I/O scheduler and does not handle I/O from file systems and applications. It is commonly used in virtual machines where disk I/O scheduling is handled by the physical machine.

The second one, NOOP, is the simplest I/O scheduling algorithm. It is actually a first-in, first-out queue with some basic request merging. It is often used for SSD disks.

The third one, CFQ (Completely Fair Scheduler), is also known as a completely fair scheduler and is the default I/O scheduler in many Linux distributions. It maintains an I/O scheduling queue for each process and evenly distributes I/O requests for each process based on time slices.

Similar to CPU scheduling for processes, CFQ also supports priority scheduling for process I/O. Therefore, it is suitable for systems with a large number of processes, such as desktop environments and multimedia applications.

The last one, Deadline scheduling algorithm, creates separate I/O queues for read and write requests. It can improve the throughput of mechanical disks and ensures that requests that meet the deadline are processed first. The Deadline scheduling algorithm is often used in scenarios with heavy I/O pressure, such as databases.

I/O Stack #

After understanding the principles of disks and the generic block layer, combined with what we learned last time about file system principles, we can now take a holistic view of the I/O principles of the Linux storage system.

We can divide the I/O stack of the Linux storage system into three levels, from top to bottom: the file system layer, the generic block layer, and the device layer. The relationship between these three I/O layers is shown in the following diagram, which is actually a panoramic view of the Linux storage system’s I/O stack.

Linux Storage Stack Diagram

Based on this panoramic view of the I/O stack, we can have a clearer understanding of the working principles of the storage system’s I/O.

The file system layer includes the virtual file system and the specific implementations of various file systems. It provides standard file access interfaces to the upper-layer applications; and to the lower layer, it stores and manages disk data through the generic block layer.
The generic block layer includes the block device I/O queue and the I/O scheduler. It queues the I/O requests from the file system, reorders and merges them, and then sends them to the next level, the device layer.
The device layer includes the storage devices and their corresponding drivers, responsible for the final physical device I/O operations.

The I/O of a storage system is usually the slowest component of the entire system. Therefore, Linux optimizes I/O efficiency through various caching mechanisms.

For example, to optimize file access performance, Linux uses multiple caching mechanisms such as page cache, inode cache, and directory entry cache to reduce direct calls to the underlying block devices.

Similarly, to optimize the access efficiency of block devices, Linux uses buffers to cache the data of block devices.

However, after discussing the abstract principles, how should we measure the I/O performance of a disk in practice? I’ll leave it as a teaser. In the next lesson, we will explore the most commonly used disk I/O performance metrics and tools together.

Summary #

In today’s article, we have outlined the working principle of Linux disk I/O and understood the Linux storage system I/O stack consisting of the file system layer, the generic block layer, and the device layer.

The generic block layer is the core of Linux disk I/O. On the upper side, it provides a standard interface for the file system and applications to access block devices. On the lower side, it abstracts various heterogeneous disk devices into a unified block device. It also reorders and merges I/O requests from file systems and applications to improve disk access efficiency.

Reflection #

Finally, I would like to invite you to discuss your understanding of disk I/O. I believe you may have encountered issues with the performance of file or disk I/O. How did you analyze these problems? You can combine today’s principles of disk I/O and the previous section’s principles of file systems, record your steps, and summarize your thought process.

Feel free to discuss with me in the comments section, and also feel free to share this article with your colleagues and friends. Let’s practice in real scenarios and improve through communication.