09 Io Optimization Part1 Essential Io Optimization Knowledge for Developers

09 IO Optimization Part1 Essential IO Optimization Knowledge for Developers #

With a capacity of 250GB, 512MB DDR4 cache, and a maximum sequential read speed of 550MB/s and write speed of 520MB/s.

During “Double 11” on Tmall, I saw an introduction for a solid-state drive with the above specifications. What do these numbers represent?

In the previous sections about lagging and startup optimization, I often mentioned I/O optimization. Many students may think that I/O optimization is simply about avoiding reading and writing large files on the main thread. But is it really that simple? Have you considered what happens when an application calls the read() method, and what processing the kernel and hardware perform? In today’s article, let’s explore the knowledge required for I/O optimization and address these questions.

Basic Knowledge of I/O #

In my work, I have found that many engineers have a vague understanding of I/O. They think that I/O is just the read() and write() operations performed by applications, without understanding the entire process behind these operations.

I have drawn a diagram that shows the whole file I/O process, which is completed by the application, file system, and disk together. First, the application sends the I/O command to the file system, and then the file system sends the I/O operation to the disk at the appropriate time.

This is like a relay race completed by three buddies - the CPU, memory, and disk. The time it takes to finish the race largely depends on the slowest buddy. We know that compared to the disk, the CPU and memory are high-speed devices. The bottleneck in the entire process lies in the disk I/O performance. Therefore, many times, the file system performance is more important than the disk performance. In order to reduce the impact of the disk on the application, the file system needs to optimize through various means. So now, let’s take a look at the file system first.

1. File System

The file system is simply a way to store and organize data. For example, in the iOS 10.3 system and later, Apple uses APFS (Apple File System) to replace the old HFS+ file system. For Android, the commonly used file system is the ext4 file system commonly used in Linux.

A few more words about file systems. In EMUI 5.0 and later, Huawei uses F2FS to replace ext4, and Google also uses the F2FS file system in its latest flagship phone, Pixel 3. The Flash-Friendly File System (F2FS) is a file system developed by Samsung specifically for NAND flash chips, and it has made a lot of optimizations for flash memory. According to Huawei’s test data, the F2FS file system is faster than ext4 in terms of random read and write of small files, for example, random write can be optimized by 60%. However, there have been some reliability issues. What I want to say is that with the investment and large-scale use by Google and Huawei, the F2FS system should be the mainstream file system for future Android.

Back to I/O of the file system. When the application calls the read() method, the system enters the kernel processing flow from the user space through interrupts, and then goes through the Virtual File System (VFS), specific file systems, and page cache. Here is a generic I/O architecture model for Linux.

Virtual File System (VFS). It is mainly used to abstract specific file systems and provide a unified interface for application operations. This ensures that even if manufacturers switch file systems from ext4 to F2FS, applications do not need to be modified.
File System. ext4 and F2FS are specific file system implementations. How to organize file metadata, design directory and index structures, how to allocate and clean data, these are all things that need to be considered when designing a file system. Each file system has its own suitable application scenarios, and we cannot say that F2FS is necessarily better than ext4. F2FS does not have an advantage in reading large files sequentially and may occupy more space. It is only for general applications, random I/O will be more frequent, especially in startup scenarios. You can see a list of all file systems recognized by the system in /proc/filesystems.
Page Cache. I have already mentioned the concept of Page Cache in the startup optimization. When reading a file, it first checks if it is already in the Page Cache. If it hits, it does not need to read from the disk. Before Linux 2.4.10, there was a separate Buffer Cache, but later it was merged into the Buffer Page in the Page Cache.

In specific terms, the Page Cache is like the data cache we often use, it is the file system’s cache for data, with the aim of improving memory hit rate. The Buffer Cache is like the BufferInputStream we often use, it is the disk’s cache for data, with the aim of aggregating some file system I/O requests and reducing the number of disk I/Os. It is important to note that they are used in both read and write requests.

You can check the memory usage of the caches in the /proc/meminfo file. When the phone memory is insufficient, the system will reclaim their memory, which will reduce the overall I/O performance. 2. Disk

A disk refers to the storage device of a system, such as the CD we often heard of in our childhood or the mechanical hard drive used on computers, and now the popular SSD solid-state drive.

As I mentioned earlier, if an application program finds that the data it wants to read() is not in the page cache, it needs to initiate a real I/O request to the disk. This process goes through the generic block layer of the kernel, the I/O scheduling layer, the device driver layer, and finally, it is handed over to the specific hardware device for processing.

Generic block layer: Devices in the system that can randomly access fixed-size data blocks are called block devices, such as CDs, hard drives, and SSDs. The main function of the generic block layer is to receive disk requests from upper layers and issue I/O requests in the end. It is similar to the VFS in that it enables upper layers to not need to care about the specific implementation of underlying hardware devices.
I/O scheduling layer: Disk I/O is slow, so in order to reduce real disk I/O, we cannot hand over disk requests to the driver layer immediately upon receiving them. Therefore, we add an I/O scheduling layer, which merges and sorts requests based on the set scheduling algorithm. There are two important parameters here: queue length and the specific scheduling algorithm. You can check the queue length and the scheduling algorithm used by the corresponding block device through the following files.

/sys/block/[disk]/queue/nr_requests      // Queue length, usually set to 128.
/sys/block/[disk]/queue/scheduler        // Scheduling algorithm

Block device driver layer: The block device driver layer selects the corresponding driver program based on the specific physical device and manipulates the hardware device to complete the final I/O request. For example, CDs rely on lasers to burn data onto the surface, while flash memory uses electronic erasure to store data.

Android I/O #

In the previous section, we discussed some knowledge related to Linux I/O. Now let’s talk about some knowledge related to Android I/O.

1. Android Flash Memory

Let’s start by briefly discussing the storage device used in mobile phones, which is flash memory, or what we commonly refer to as ROM.

Considering the size and power consumption, we certainly cannot directly use the SSD solution from PCs on mobile phones. Android mobile phones usually used the eMMC standard a few years ago, but in recent years, they have generally adopted the faster UFS 2.0/2.1 standard. The so-called “flashgate” incident caused by a certain manufacturer’s replacement of UFS flash memory with eMMC flash memory caused quite a stir. On the other hand, Apple has been sticking to an independent path, introducing the NVMe protocol on the iPhone 6s in 2015, which has received praise on MacBook.

In recent years, the development of mobile hardware has been astonishing, and mobile storage has been heading towards smaller size, lower power consumption, faster speed, and larger capacity. The capacity of the iPhone XS has reached 512GB, with a continuous read speed of over 1GB/s, which is faster than many SSD solid state drives, and it has also greatly reduced the speed gap with memory. However, these are test data provided by manufacturers, and the performance in terms of random read and write is still much worse compared to memory.

The above numbers may seem a bit abstract, but to put it simply, the performance of flash memory will affect the speed of opening WeChat, loading games, and taking continuous selfies. Of course, flash memory performance is not only determined by hardware, but also has a lot to do with the standard used and the implementation of the file system.

2. Two Questions

You may wonder, now that we know the process of file reading and writing, the file system, and the disk, what is their relevance to actual development? Here, I will give two simple examples. You may have thought about them before, but if you are not familiar with the internal mechanisms of I/O, you probably only have a vague understanding.

Question 1: Why do files get corrupted?

Let’s first present two objective data. The SQLite database used for WeChat chat records has an estimated corruption rate of one in several ten thousand, and if SharedPreference is frequently read and written across processes, it may also have a corruption rate of one in several ten thousand.

Before answering why files get corrupted, we need to clarify what file corruption means. If the format or content of a file does not match the result when it was written by the application, it is considered as file corruption. It is not just about file format errors, but the loss of file content is the most common issue, which can easily occur when using SharedPreference to read and write across processes.

Now, let’s explore why files get corrupted. We can examine this problem from three perspectives: the application, the file system, and the disk.

Application: Most I/O methods are not atomic operations. Writing to a file across processes or threads, or using a closed file descriptor (fd) to operate on a file, can potentially overwrite or delete data. In fact, most file corruptions are caused by improper application code design, rather than issues with the file system or the disk.
File system: While it is true that kernel crashes or sudden power outages can cause file system corruption, the file system also has many protective measures. For example, the system partition ensures read-only access, and there are additional exception checking and recovery mechanisms, as well as tools like ext4’s fsck, f2fs’s fsck.f2fs, and checkpoint mechanisms.

(Note: This is a direct translation of the content given. Some phrases or terms may be specific to Android development and may not have an exact English equivalent.) At the file system level, data loss due to power outages is more common. In order to improve I/O performance, the file system writes data to the page cache and then waits for the appropriate time to write it to the disk. Of course, we can also force data to be written to the disk through interfaces like fsync and msync. I will explain direct I/O and cache I/O in more detail later.

Disk. The flash memory used in smartphones is an electronic storage device, so data errors may occur during data transmission due to electronic losses and other phenomena. However, flash memory also increases data reliability through methods such as ECC and multi-level encoding. Generally speaking, the likelihood of such situations occurring is relatively small.

Flash memory life can also lead to data errors. Due to the internal structure and characteristics of flash memory, the address that has been written to must be erased before it can be rewritten, and each block has a limited number of erasures, ranging from hundreds of thousands to thousands of times depending on the storage particles used (SLC > MLC > TLC).

The following diagram shows the structure of flash memory, with the Flash Translation Layer (FTL) being particularly important. It is responsible for allocating and managing physical addresses. It needs to consider the erase life of each block and balance the erase count across all blocks. When the space in a block is not enough, it also needs to migrate data through garbage collection algorithms. FTL determines the lifespan, performance, and reliability of flash memory and is one of the most important core technologies in flash memory technology.

Flash Memory Structure

For smartphones, let’s assume that our storage size is 128GB. Even if the maximum number of erasures for flash memory is only 1000 times, it can still write 128TB. However, it is generally difficult to reach this level.

Question 2: Why does I/O sometimes suddenly become slow?

The data on mobile phones is usually the factory data. When using an Android phone, we may find that a phone that was “smooth as silk” when first bought becomes extremely laggy after a year of use.

Why is this? I have found a lot of I/O-related lags on some low-end devices. There may be several reasons why I/O suddenly becomes slow:

Insufficient memory. When the internal memory of a phone is insufficient, the system will reclaim the memory of the page cache and buffer cache. Most write operations will be directly written to disk, resulting in poor performance.
Write amplification. As I mentioned earlier, when flash memory is repeatedly written, an erase operation must be performed before each write. The basic unit of this erase operation is a block. Writing data to a page will cause the migration of the entire block, which is a typical write amplification phenomenon. Low-end devices or devices that have been used for a long time are prone to write amplification due to fragmented disks and limited free space. Specifically, flash memory read operations are the fastest, at around 20us. Write operations are slower than read operations, at around 200us. Erase operations are very time-consuming, at the magnitude of 1ms. When write amplification occurs, the time will be even longer due to the movement of data.
Due to the relatively poor performance of CPUs and flash memory in low-end devices, bottlenecks are more likely to occur under high loads. For example, eMMC flash memory does not support concurrent read and write operations. When write amplification occurs, read operations will also be affected.

To alleviate disk fragmentation, the system can introduce fstrim/TRIM mechanisms that trigger disk defragmentation during screen lock, charging, and other occasions.

Performance Evaluation of I/O #

As you can see from the figure below, the entire I/O process involves a very long chain. We found in our application by marking that a file read takes 300ms. However, each layer below may have its own strategies and scheduling algorithms, making it difficult to truly obtain the time consumption of each layer.

In the previous section on boot optimization, I mentioned that Facebook and Alipay use the method of compiling separate ROMs to evaluate I/O performance. This is a complex but effective approach. By customizing the source code and selecting the logs of interest, we can trace the I/O performance.

1. I/O Performance Metrics

The most important metrics in evaluating I/O performance are throughput and IOPS. The statement at the beginning of today’s article, “continuous reading does not exceed 550MB/s, continuous writing does not exceed 520MB/s,” refers to I/O throughput.

Another important metric is IOPS, which refers to the number of reads and writes that can be performed per second. For applications with frequent random reading and writing, such as storage of large numbers of small files, IOPS is a key measurement.

2. I/O Measurement

If we don’t use the method of customizing the source code, what other methods can we use to measure I/O performance?

The first method: using proc.

In general, I/O performance is affected by many factors, such as whether it’s a read or write operation, whether it’s continuous, and the size of the I/O operation. Another factor that has a significant impact on I/O performance is the load - I/O performance decreases as the load increases. We can measure this by observing the I/O wait time and count.

proc/self/schedstat:
  se.statistics.iowait_count: number of I/O waits
  se.statistics.iowait_sum: I/O wait time

If the device is rooted, we can enable kernel I/O monitoring and dump all block reads and writes to a log file, which can be viewed using the dmesg command.

echo 1 > /proc/sys/vm/block_dump
dmesg -c | grep pid

.sample.io.test(7540): READ block 29262592 on dm-1 (256 sectors)
.sample.io.test(7540): READ block 29262848 on dm-1 (256 sectors)

The second method: using strace.

Linux provides some related commands such as iostat and iotop, but most Android devices do not support them. We can use strace to trace the number of I/O-related system calls and their durations.

strace -ttT -f -p [pid]

read(53, "*****************"\.\.\., 1024) = 1024       <0.000447>
read(53, "*****************"\.\.\., 1024) = 1024       <0.000084>
read(53, "*****************"\.\.\., 1024) = 1024       <0.000059>

From the above log, you can see that the application is reading from file descriptor 53, and each read operation reads 1024 bytes. The first read operation took 447 microseconds, while the next two took less than 100 microseconds. This is due to the “data rearrangement” mentioned in the boot optimization section, where the file system reads in blocks as a unit, and the size of a block is usually 4KB. The later two reads are satisfied by the page cache.

We can also use strace to measure the overall time consumption of all system calls within a certain period of time. However, strace itself consumes a considerable amount of resources and will affect the execution time.

strace -c -f -p [pid]

% time     seconds  usecs/call     calls    errors  syscall
------ ----------- ----------- --------- --------- ----------------
 97.56    0.041002          21      1987             read
  1.44    0.000605          55        11             write

From the above information, you can see that reading accounts for 97.56% of the time, with a total of 1987 calls, taking 0.04s, and an average of 21 microseconds per system call. Similarly, we can calculate the percentage of I/O time in a specific task of the application. For example, if a task takes 10s to execute and 9s are spent on I/O, then the I/O time consumption percentage is 90%. In this case, I/O is a major bottleneck in our task, and further optimization is needed.

The third method: using vmstat.

The explanation of each field in vmstat can be found in Monitoring Memory Usage with vmstat. The values of fields like buff and cache in Memory, bi and bo in I/O, cs in System, and sy and wa in CPU are all related to I/O behavior.

We can use dd command in conjunction with vmstat to observe changes in output data. However, please note that the dd command in Android seems not to support the conv and flag parameters.

// Clear buffer and cache memory caches
echo 3 > /proc/sys/vm/drop_caches
// Output 1 group of vmstat data every 1 second
vmstat 1

// Test the write speed and write to the file /data/data/test, with a buffer size of 4K and 1000 iterations
dd if=/dev/zero of=/data/data/test bs=4k count=1000

Summary #

In the process of performance optimization, we pay the most attention to CPU and memory, while I/O is also an important part of performance optimization.

Today, we learned about the entire process of I/O handling, which includes the application program, file system, and disk. However, the topic of I/O is really vast, and it requires more time to study the reference materials in the exercises after class.

LPDDR5 and UFS 3.0 will soon be introduced in 2019. Some students may think that we don’t need to optimize anymore as the hardware becomes more powerful. However, on the one hand, considering the issue of cost, the hardware of devices in embedded systems, IoT, and other scenarios may not be so good; on the other hand, we have higher requirements for application experience, and new features such as immersive experience (VR) and artificial intelligence (AI) also have higher hardware requirements. Therefore, application optimization is timeless, but there are different requirements in different scenarios.

After-class Exercises #

After studying today’s content, most students may feel a bit unfamiliar and confused. But it’s okay, we can supplement more basic knowledge after class. The following links are recommended reference materials for you. Today’s homework is to write in the comments section about your understanding of I/O and the I/O-related problems you have encountered through today’s study.

“Practice makes perfect.” You can also try using strace and block_dump to observe the I/O situation of your own application, but some experiments may require a machine with root access.

Feel free to click on “Share with friends” to share today’s content with your friends and invite them to study together. Finally, don’t forget to submit today’s homework in the comments section. I have also prepared a generous “Study and Cheer Up Gift Pack” for students who complete the homework seriously. Looking forward to discussing and making progress together with you.