10 Io Optimization Part2 Different Io Usage Scenarios

10 IO Optimization Part2 Different IO Usage Scenarios #

Today is the first day of 2019. Before we start today’s study, I would like to wish you a happy new year and successful work.

I/O is a very large topic, and it is difficult to explain every detail at once. For server developers, they can choose the appropriate file system and disk type according to their needs, and adjust kernel parameters as needed. But for mobile developers, it seems that we can’t do much about I/O optimization?

In fact, it is not the case. “Data reordering” in startup optimization is an example. If we have a clear understanding of the working mechanism of file systems and disks, we can avoid some detours and reduce the problems caused by application I/O.

In the previous article, I mentioned the Page Cache mechanism multiple times, which greatly improves the performance of disk I/O, but it may also result in data loss during write operations. So, what are the I/O methods available and how should we apply them in our actual work? Today, let’s take a look at the use cases for different I/O methods.

Three Ways of I/O #

First, let’s recall the Linux general I/O architecture model mentioned in the previous article, which includes several components such as application programs, file systems, Page Cache, and disks. Observant students may have noticed that there are two other I/O methods in the leftmost and upper right parts of the diagram, namely Direct I/O and mmap.

The diagram seems a bit complicated, so I have redrawn a simplified version for you below. The diagram shows the differences in the process of the three I/O methods: Standard I/O, mmap, and Direct I/O. Next, I will explain the key points of each I/O method and the things to pay attention to in practical applications.

1. Standard I/O

The read/write operations commonly used by our application programs belong to Standard I/O, also known as Buffered I/O. Its key features include:

For read operations, when an application program reads a certain block of data, if that block of data is already stored in the Page Cache, the data can be immediately returned to the application program without the need for an actual physical disk read operation.
For write operations, the application program also writes the data to the Page Cache first. Whether the data is immediately written to the disk depends on the write mechanism adopted by the application program. By default, the system uses a delayed write mechanism, so the application program only needs to write the data to the Page Cache and does not need to wait for all the data to be written back to the disk. The system will periodically flush the data stored in the Page Cache to the disk.

From this, we can see that Buffered I/O can greatly reduce the actual read/write operations on the disk, thereby improving performance. However, I mentioned in the previous article that the delayed write mechanism may lead to data loss. So when will the system actually write the data in the Page Cache to the disk?

The memory that has been modified in the Page Cache is called “dirty pages,” and the kernel writes the data to the disk periodically through a flush thread. The specific conditions for writing can be obtained through the /proc/sys/vm file or the sysctl -a | grep vm command.

// Flush every 5 seconds
vm.dirty_writeback_centisecs = 500  
// Dirty data in memory that has resided for more than 30 seconds will be written to the disk in the next flush execution
vm.dirty_expire_centisecs = 3000 
// If dirty pages account for more than 10% of the total physical memory, trigger a flush to write the dirty data back to the disk
vm.dirty_background_ratio = 10
// Maximum total size of dirty page cache that the system can have
vm.dirty_ratio = 20

In practical applications, if we consider certain data to be very important and cannot afford the risk of loss, we should use synchronous write mechanisms. When using system calls such as sync, fsync, and msync in an application program, the kernel will immediately write the corresponding data back to the disk.

In the above diagram, I used read() operation as an example. It causes the data to be copied from the disk to the Page Cache first and then from the Page Cache to the user space of the application program, resulting in an additional memory copy. The system is designed like this primarily because memory is a high-speed device compared to the disk, and even with an additional 100 copies, memory is still faster than reading the disk once.

2. Direct I/O

Many databases have their own cache management for data and index, and they are not heavily dependent on the Page Cache. They prefer to bypass the Page Cache mechanism to avoid an additional data copy, and these data will not pollute the Page Cache.

From the diagram, you can see that Direct I/O accessing files reduces one data copy and the time spent on certain system calls, greatly reducing CPU usage and memory consumption.

However, Direct I/O sometimes has a negative impact on performance.

For read operations, reading data from disk will cause synchronous reads, resulting in longer execution time for the process.
For write operations, using direct I/O also requires synchronous execution, which will cause the application to wait.

Android does not provide Java’s DirectByteBuffer. Direct I/O needs to specify the O_DIRECT parameter when opening the file. For more information, please refer to “Introduction to Direct I/O Mechanism in Linux”. Before using direct I/O, the application must have a clear understanding. Only when it is determined that the cost of buffered I/O is very high, can direct I/O be considered.

3. mmap

When Android starts and loads the Dex file, it does not read the entire file into memory at once, but uses mmap instead. WeChat’s high-performance log system, xlog, also uses mmap to ensure performance and reliability.

What exactly is mmap? Can it really achieve data loss prevention and excellent performance? In fact, mmap maps the file into the process’s address space. Many articles on the Internet claim that mmap completely bypasses the page cache mechanism, which is not entirely correct. The final mapped physical memory is still in the page cache, and it provides the following benefits:

Reduces system calls. We only need to make one mmap() system call, and all subsequent calls are like operating on memory, without the need for many read/write system calls.
Reduces data copying. For regular read() calls, data needs to be copied twice, while mmap only needs to copy from disk once, and because of the memory mapping, there is no need to copy back to the user space.
High reliability. After mmap writes data to the page cache, similar to the delayed write mechanism of buffered I/O, it can rely on kernel threads to periodically write back to disk. However, it should be noted that mmap can also cause data loss in case of kernel crashes or sudden power outages. Of course, we can use msync to force synchronization writes.

From the above diagram, it can be seen that we only need one data copy when using mmap. It seems that mmap can indeed outperform regular file read and write operations, so why don’t we use mmap for everything? In fact, there are also some drawbacks:

Increased virtual memory consumption. Mmap increases virtual memory consumption. Our APK, Dex, and .so files are all read using mmap. However, most applications currently do not support 64-bit. Excluding the address space used by the kernel, we generally only have around 3GB of virtual memory space available. If we mmap a 1GB file, the application is prone to running out of virtual memory and causing OOM.
Disk latency. Mmap initiates real disk I/O through page faults. Therefore, if our current problem is high disk I/O latency, using mmap() to eliminate the minor system call overhead is not effective. The class reordering technique mentioned in startup optimization is to rearrange the classes in Dex according to their startup order, mainly to reduce the disk I/O latency caused by page faults.

In Android, files can be mapped to memory using MemoryFile or MappedByteBuffer, and then read and write operations can be performed. This approach has certain advantages for small files and files with frequent read and write operations. I conducted a simple code test, and the test results are as follows.

From the above data, it seems that mmap’s performance in writing to memory is similar to writing directly to memory, but this is not correct because we did not calculate the time it takes for the file system to asynchronously flush to the disk. On low-end devices or systems with severe resource constraints, mmap will also result in frequent disk writes, causing a rapid decline in performance.

Mmap is more suitable for frequent read and write operations on the same region. It is also recommended to use threading. User logs and data reporting scenarios meet this requirement. Moreover, when cross-process synchronization is needed, mmap is also a good choice. Android’s cross-process communication has its unique Binder mechanism, which also uses mmap internally.

With mmap, Binder only needs one data copy for cross-process communication, which reduces one data copying process compared to traditional cross-process communication methods such as sockets and pipes.

Using mmap, Binder makes only one data copy for cross-process communication, reducing one data copying process compared to traditional cross-process communication methods like sockets and pipes.

Threaded Blocking I/O and NIO #

In the previous article, I mentioned that I/O operations can be very slow, especially on low-end machines, due to the phenomenon of write amplification.

Therefore, I/O operations should be placed in threads whenever possible. However, many people may have the question: if we want to read 10 files at the same time, should we use a single thread or 10 threads for concurrent reading?

1. Threaded Blocking I/O

Let’s conduct an experiment. We use a Nexus 6P to read 30 files, each with a size of 40MB, and test with different numbers of threads.

You can see that multi-threading does not provide a significant benefit for I/O operations. The total time decreases from 3.6 seconds to 1.1 seconds. This is because, compared to the disk, the CPU performance is like a rocket. The bottleneck in I/O operations is mainly the disk bandwidth, and having 30 threads does not bring a 30-fold improvement. In fact, too many threads can even increase the time required, as we can see from the table that 30 threads take longer than 10 threads. However, when the CPU is busy, having more threads gives us a greater chance of winning CPU time slices, so multi-threading can have greater benefits compared to single-threading in such cases.

In general, file read and write operations are affected by I/O performance bottlenecks, and once a certain speed is reached, overall performance will be significantly affected. Having too many threads can actually lead to a noticeable decline in application performance.

Case 1:
CPU: 0.3% user, 3.1% kernel, 60.2% iowait, 36% idle...
Case 2:
CPU: 60.3% user, 20.1% kernel, 14.2% iowait, 4.6% idle...

Let’s take a look at the two cases above.

Case 1: When the system is idle (36% idle) and there are no other threads waiting to be scheduled, I/O waits (60.2% iowait) will occur.

Case 2: If our system becomes busy, the CPU will not be “idle” and will go check if there are other threads that need to be scheduled, causing the I/O wait time to decrease (14.2% iowait). However, having too many threads blocking can lead to frequent thread context switching, increasing the overhead of context switching.

In short, if iowait is high, there is definitely an issue with I/O. But if iowait is low, it doesn’t necessarily mean there is no issue with I/O. In that case, we need to look at the CPU’s idle ratio as well. In the figure below, we can see the synchronous I/O working mode:

For an application, the total time that the disk I/O blocks the thread is more reasonable, and it doesn’t matter whether the CPU is actually waiting or doing other work. In actual development work, most of the time we are reading relatively small files, so whether to use a separate I/O thread or a dedicated new thread doesn’t make much difference.

2. NIO

Threaded blocking I/O increases system overhead, so can we use asynchronous I/O? When our thread encounters an I/O operation, instead of waiting in a blocking manner for the completion of the I/O operation, we send the I/O request to the system and continue execution. You can refer to the diagram below for this process.

Non-blocking NIO notifies I/O to happen in an event-driven manner, which indeed reduces the overhead of thread context switching. The Chrome network library is a good example of using NIO to improve performance, especially in extremely busy systems. However, the downsides of NIO are also very obvious - application implementation becomes more complex, and sometimes asynchronous refactoring is not easy.

Now let’s take a look at using FileChannel in NIO to read and write files. FileChannel needs to use ByteBuffer to read and write files. You can use ByteBuffer.allocate(int size) to allocate space or use ByteBuffer.wrap(byte[]) to wrap a byte array directly. The example above shows the time taken using NIO and non-NIO when the CPU is idle and busy.

From the data above, you can see that overall performance using NIO is not significantly different from non-NIO. This can actually be understood. When the CPU is idle, regardless of whether our thread continues to do other work or not, the current bottleneck is still the disk, so the overall time consumption will not be significant. When the CPU is busy, regardless of whether NIO is used or not, the limited CPU time slice that a single thread can snatch remains the same.

So does NIO have any effect at all? The maximum advantage of using NIO is not reducing the time consumption of reading files, but maximizing the overall CPU utilization of the application. When the CPU is busy, we can use the time when the thread is waiting for disk I/O to perform some CPU operations. I highly recommend Square’s Okio, which supports both synchronous and asynchronous I/O and has many optimizations. You can try it out.

Small File System #

For a file system, the performance of directory lookup is very important. For example, there may be tens of thousands of pictures in the WeChat Moments. If each picture is a separate file, there will be tens of thousands of small files in the directory. Just think about what impact this will have on I/O performance.

To read a file, we need to first locate its storage position. In the file system, we use inodes to store directories. The time it takes to read a file can be divided into the following two parts:

File read time = Time to find the inode of the file + Time to read the file data based on the inode

If we need to frequently read and write tens of thousands of small files, the time to find inodes will become considerable. This time depends on the file system implementation.

For the FAT32 file system, which is a product of a long history, it is used in some low-end external SD cards. When there are a large number of files in a directory, linear searching is required, and even a simple exist() function can easily cause an Application Not Responding (ANR) issue.
For the ext4 file system, it uses a directory hash index for lookup, which greatly reduces the directory lookup time. However, if frequent operations on a large number of small files are needed, the time to find and open files cannot be ignored either.

After merging a large number of small files into one large file, we can also merge and store the small files that can be accessed sequentially, turning the random access between small files into sequential access, which greatly improves performance. Meanwhile, merging storage can effectively reduce disk fragmentation that occurs during small file storage, improving disk utilization.

In the industry, Google’s GFS, Taobao’s open-source TFS, and Facebook’s Haystack are all file systems designed specifically for the storage and retrieval of massive small files. WeChat has also developed a small file management system called SFS, which is mainly used for managing pictures in Moments and to solve the performance issues when using FAT32 in external SD cards at the time.

Of course, designing a small file system is not that simple. It needs to support the VFS interface so that the I/O operation code in the upper layers does not need to be modified. In addition, file indexing and verification mechanisms need to be considered, such as how to quickly find the corresponding part from a large file. File fragmentation also needs to be considered, for example, we found that if a file is too large, it is very easy to be deleted by apps like “Phone Manager”.

Summary #

In the process of performance optimization, we usually pay the most attention to the CPU and memory, but I/O is also an important part of performance optimization.

Today, we first learned about the entire process of I/O, which includes the application, file system, and disk. Then I introduced the differences between multithreaded synchronous I/O, asynchronous I/O, and mmap, as well as the scenarios in which they are applicable in actual work.

Both the file system and disk involve many details. Moreover, with the development of technology, some designs have become outdated, such as when FAT32 was designed, it was believed that a single file would unlikely exceed 4GB. If one day in the future, the performance of the disk can catch up with that of the memory, then various caches will no longer be needed for the file system.

Exercise #

In today’s class, we discussed several different use cases for I/O methods. Have you used any other I/O methods besides standard I/O in your daily work? Feel free to leave a comment and discuss with me and other classmates.

I also conducted some basic performance tests on different I/O methods in the article. Today’s exercise is to write some test cases for different scenarios. This will help you better understand the use cases of different I/O methods.

Please click “Share with friends” to share today’s content with your friends and invite them to study together. Finally, don’t forget to submit today’s homework in the comment section. I have prepared a generous “learning encouragement package” for students who complete the homework diligently. I look forward to learning and progressing together with you!