31 How Tos Several Thoughts on Disk Io Performance Optimization

31 How-tos Several Thoughts on Disk IO Performance Optimization #

Hello, I’m Ni Pengfei.

In the previous section, we reviewed common file systems and disk I/O performance metrics, organized the core I/O performance observation tools, and summarized the approach to quickly analyze I/O performance issues.

Although there are many performance metrics for I/O and several corresponding analysis tools, once you understand the meaning of various metrics, you will find that they are actually related to each other.

By understanding these relationships, you will discover that mastering these commonly used bottleneck analysis approaches is not difficult.

After identifying the performance bottleneck of I/O, the next step is optimization, which means how to complete I/O operations at the fastest speed or, in other words, reduce or even avoid disk I/O operations.

Today, I will talk about the approaches and considerations for optimizing I/O performance issues.

I/O Benchmark Testing #

As is my usual practice, before optimization, I would first ask myself, what are the goals of I/O performance optimization? In other words, what levels of I/O performance indicators (such as IOPS, throughput, latency, etc.) are considered appropriate?

In fact, the specific standards for I/O performance indicators may vary for each person, as our application scenarios, file systems, and physical disks may differ.

In order to objectively and reasonably evaluate the optimization effect, we should first conduct benchmark testing on the disk and file system to obtain the maximum performance of the file system or disk I/O.

fio (Flexible I/O Tester) is the most commonly used benchmark testing tool for file systems and disk I/O. It provides a wide range of customizable options that can be used to test the I/O performance of raw disks or file systems in various scenarios, including different block sizes, I/O engines, and whether or not to use caches.

Installing fio is relatively simple. You can execute the following command to install it:

# Ubuntu
apt-get install -y fio

# CentOS
yum install -y fio

After installation, you can execute man fio to query its usage methods.

Fio has a lot of options. I will introduce some of the most commonly used options by testing several common scenarios. These common scenarios include random read, random write, sequential read, and sequential write. You can execute the following commands to test:

# Random read
fio -name=randread -direct=1 -iodepth=64 -rw=randread -ioengine=libaio -bs=4k -size=1G -numjobs=1 -runtime=1000 -group_reporting -filename=/dev/sdb

# Random write
fio -name=randwrite -direct=1 -iodepth=64 -rw=randwrite -ioengine=libaio -bs=4k -size=1G -numjobs=1 -runtime=1000 -group_reporting -filename=/dev/sdb

# Sequential read
fio -name=read -direct=1 -iodepth=64 -rw=read -ioengine=libaio -bs=4k -size=1G -numjobs=1 -runtime=1000 -group_reporting -filename=/dev/sdb

# Sequential write
fio -name=write -direct=1 -iodepth=64 -rw=write -ioengine=libaio -bs=4k -size=1G -numjobs=1 -runtime=1000 -group_reporting -filename=/dev/sdb

Among them, there are several parameters that require your attention:

direct: Whether to bypass the system cache. In the examples above, I set it to 1 to bypass the system cache.
iodepth: The maximum number of I/O requests issued simultaneously when using asynchronous I/O (AIO). In the examples above, I set it to 64.
rw: The I/O mode. In my examples, read/write represents sequential read/write, while randread/randwrite represents random read/write.
ioengine: The I/O engine, which supports various I/O engines such as sync, libaio (asynchronous I/O), mmap (memory mapping), and net (network). In the examples above, I set it to libaio to use asynchronous I/O.
bs: The size of the I/O. In the examples, I set it to 4K (which is also the default).
filename: The file path, which can be either a disk path (to test disk performance) or a file path (to test file system performance). In the examples, I set it to the disk /dev/sdb. However, please note that testing write with a disk path will destroy the file system on that disk, so you must backup the data before using it.

Here is an example of a report from my sequential read test using fio:

read: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
fio-3.1
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=16.7MiB/s,w=0KiB/s][r=4280,w=0 IOPS][eta 00m:00s]
read: (groupid=0, jobs=1): err= 0: pid=17966: Sun Dec 30 08:31:48 2018
   read: IOPS=4257, BW=16.6MiB/s (17.4MB/s)(1024MiB/61568msec)
    slat (usec): min=2, max=2566, avg= 4.29, stdev=21.76
    clat (usec): min=228, max=407360, avg=15024.30, stdev=20524.39
     lat (usec): min=243, max=407363, avg=15029.12, stdev=20524.26

In this report, the key parameters to focus on are slat, clat, lat, bw, and iops.

Let’s first look at the slat, clat, and lat parameters. These parameters represent I/O latency. The difference between them is as follows:

slat refers to the duration from I/O submission to the actual execution of the I/O (Submission latency).
clat refers to the duration from I/O submission to I/O completion (Completion latency).
lat refers to the total duration from fio creating an I/O to I/O completion.

It is important to note that, for synchronous I/O, since I/O submission and completion are simultaneous, slat is effectively the I/O completion time, and clat is 0. In the example provided, for asynchronous I/O (libaio), lat is approximately equal to the sum of slat and clat.

Next, let’s consider bw, which represents throughput. In the example above, you can see that the average throughput is approximately 16 MB (17005 KiB/1024).

Lastly, iops represents the number of I/O operations per second. In the given example, the average IOPS is 4250.

Typically, application I/O is read-write parallel, and the size of each I/O may vary. Therefore, the scenarios mentioned earlier cannot accurately simulate the I/O patterns of an application. So how can we accurately simulate the I/O pattern of an application?

Fortunately, fio supports I/O replay. With the help of blktrace mentioned earlier, along with fio, we can benchmark the I/O pattern of an application. You need to first use blktrace to record the I/O access of the disk device that the application is using, and then use fio to replay the recorded blktrace.

For example, you can run the following commands:

# Trace disk I/O using blktrace, make sure to specify the disk device that the application is accessing
$ blktrace /dev/sdb

# Look at the recorded results of blktrace
$ ls
sdb.blktrace.0  sdb.blktrace.1

# Convert the results to binary format
$ blkparse sdb -d sdb.bin

# Replay the log using fio
$ fio --name=replay --filename=/dev/sdb --direct=1 --read_iolog=sdb.bin

By using the combination of blktrace and fio, we can obtain a benchmark report for the I/O pattern of an application.

I/O Performance Optimization #

After obtaining the I/O benchmark report, it becomes a matter of course to find and optimize the I/O performance bottleneck using the performance analysis techniques summarized in the previous section. Of course, optimizing I/O performance cannot be separated from the approach of Linux system’s I/O stack diagram. You can review it with the following I/O stack diagram.

Next, I will take you through the basic approaches to optimizing I/O performance from the perspectives of application programs, file systems, and disks.

Application Program Optimization #

First, let’s take a look at the approaches to optimizing I/O from the perspective of application programs.

The application program is at the top of the entire I/O stack. It can adjust the I/O mode (such as sequential or random access, synchronous or asynchronous) through system calls, and it is also the ultimate source of I/O data. In my opinion, there are several ways to optimize the I/O performance of application programs.

Firstly, random writes can be replaced with append writes to reduce addressing overhead and speed up I/O writes.

Secondly, caching I/O can be used to make full use of system cache and reduce the actual number of I/O operations.

Thirdly, you can build your own cache within the application program or use external cache systems like Redis. This way, on the one hand, you can control the data and lifecycle of the cache within the application program, and on the other hand, you can reduce the impact of other applications’ use of the cache on the application program itself.

For example, in the previous MySQL case, we have seen that just because a disruption application cleared the system cache, it resulted in a performance difference of hundreds of times for MySQL queries (0.1s vs 15s).

Similarly, library functions provided by the C standard library, such as fopen and fread, will use the standard library’s cache to reduce disk operations. When you directly use system calls like open and read, you can only use the page cache and buffers provided by the operating system without the cache of library functions.

Fourthly, when frequent read and write operations are needed on the same block of disk space, you can use mmap instead of read/write to reduce the number of memory copies.

Fifthly, in scenarios that require synchronous writes, try to merge write requests instead of making each request write to the disk synchronously. In other words, you can use fsync() instead of O_SYNC.

Sixthly, when multiple applications share the same disk, to ensure that I/O is not completely occupied by a certain application, it is recommended to use cgroups I/O subsystem to limit the IOPS and throughput of processes/process groups.

Finally, when using the CFQ scheduler, you can use ionice to adjust the I/O scheduling priority of processes, especially to increase the I/O priority of core applications. ionice supports three priority classes: Idle, Best-effort, and Realtime. Among them, Best-effort and Realtime also support levels from 0 to 7, with smaller values indicating higher priority levels.

File System Optimization #

When application programs access ordinary files, the actual read and write of files are indirectly handled by the file system. Therefore, there are many ways related to the file system that can optimize I/O performance.

Firstly, you can choose the most suitable file system based on different actual load scenarios. For example, Ubuntu uses the ext4 file system by default, while CentOS 7 uses the xfs file system by default.

Compared with ext4, xfs supports larger disk partitions and a larger number of files, such as supporting disks larger than 16TB. However, the disadvantage of the xfs file system is its inability to shrink, while ext4 can. Secondly, after choosing the file system, you can further optimize the configuration options of the file system, including the features of the file system (such as ext_attr, dir_index), the journal mode (such as journal, ordered, writeback), and the mount options (such as noatime), etc.

For example, using the tune2fs tool, you can adjust the features of the file system (tune2fs is also commonly used to view the contents of the file system’s superblock). And through /etc/fstab or the mount command line parameters, we can adjust the journal mode and mount options of the file system, etc.

Thirdly, you can optimize the file system cache.

For example, you can optimize the flushing frequency of pdflush dirty pages (such as setting dirty_expire_centisecs and dirty_writeback_centisecs) as well as the limit of dirty pages (such as adjusting dirty_background_ratio and dirty_ratio).

You can also optimize the kernel’s tendency to reclaim directory entry cache and inode cache, that is, adjust vfs_cache_pressure (/proc/sys/vm/vfs_cache_pressure, default value 100). The larger the value, the easier it is to reclaim.

Finally, when persistence is not needed, you can use the tmpfs memory file system to achieve better I/O performance. tmpfs stores data directly in memory instead of on disk. For example, /dev/shm/ is a memory file system that is commonly configured as half of the total memory in most Linux systems.

Disk Optimization #

The persistence storage of data ultimately relies on specific physical disks, and the disk is also the lowest level of the entire I/O stack. Looking from the perspective of disks, there are naturally many effective performance optimization methods.

Firstly, the simplest and most effective optimization method is to use a higher-performance disk, such as SSD instead of HDD.

Secondly, we can use RAID to combine multiple disks into one logical disk to form a redundant array of independent disks. This can improve both data reliability and access performance.

Thirdly, based on the characteristics of disk and application I/O patterns, we can choose the most suitable I/O scheduling algorithm. For example, SSD and disks in virtual machines typically use the noop scheduling algorithm. For database applications, I recommend using the deadline algorithm.

Fourthly, we can perform disk-level isolation for application data. For example, we can configure separate disks for applications with heavy I/O pressure, such as logs and databases.

Fifthly, in scenarios where sequential reads are predominant, we can increase the disk’s read-ahead data. For example, you can adjust the read-ahead size of /dev/sdb through the following two methods:

Adjust the kernel option /sys/block/sdb/queue/read_ahead_kb, the default size is 128 KB and the unit is KB.
Use the blockdev tool to set it, for example, blockdev –setra 8192 /dev/sdb. Note that the unit here is 512B (0.5KB), so its value is always twice the value of read_ahead_kb.

Sixthly, we can optimize the options for kernel block device I/O. For example, we can adjust the length of the disk queue /sys/block/sdb/queue/nr_requests. Increasing the queue length appropriately can improve disk throughput (although it will also increase I/O latency).

Finally, it is important to note that hardware errors in the disk itself can also lead to a sharp decline in I/O performance. Therefore, when you find a drastic decline in disk performance, you also need to confirm whether there is a hardware error in the disk itself.

For example, you can check if there are logs of hardware I/O failures in dmesg. You can also use tools like badblocks, smartctl to detect hardware problems with the disk, or use e2fsck to check for file system errors. If problems are found, you can use tools like fsck to repair them.

Summary #

Today, we have reviewed the performance optimization ideas and methods for common file systems and disk I/O. When encountering I/O performance issues, instead of rushing to optimize, it is important to first identify the most critical problems that can significantly improve performance. Then, we can start from different layers of the I/O stack and consider specific optimization methods.

Remember, disk and file system I/O is usually the slowest module in the entire system. Therefore, when optimizing I/O issues, besides optimizing the execution flow of I/O, we can also utilize faster memory, network, CPU, etc. to reduce I/O calls.

For example, you can fully utilize the buffering and caching provided by the system, or use internal caching within the application, or even external caching systems like Redis.

Reflection #

In the study of the entire section, I only listed a few common I/O performance optimization ideas. In addition to these, there are many optimization methods from application, system to disk hardware. I would like to discuss with you, do you know any other optimization methods?

Please feel free to discuss with me in the comments section, and also feel free to share this article with your colleagues and friends. Let’s practice in real-world scenarios and improve through communication.