11 Io Optimization Part3 How to Monitor Online Io Operations

11 IO Optimization Part3 How to Monitor Online IO Operations #

Based on our previous learning, I believe you have gained some understanding of the basics of I/O and methods for measuring I/O performance.

However, in real-world applications, do you know what constitutes unreasonable I/O operations? How can we identify inappropriate I/O operations in our code? Furthermore, is it possible to continuously monitor the usage of I/O in our applications online? Today, let’s explore how to address these questions.

I/O Tracking #

Before monitoring I/O operations, you need to know which I/O operations are present in the application.

As mentioned in the previous section on lag optimization, Facebook’s Profilo uses PLT Hook technology to listen for writes to the “atrace_marker_fd” file in order to obtain ftrace information. So, what other methods can be used for I/O tracking, and what information should we track?

1. Java Hook

For compatibility reasons, the first method you might think of is instrumentation. However, instrumentation cannot monitor all I/O operations because there is a large amount of system code that also performs I/O operations.

For stability reasons, as a second option, you can try using the Java Hook approach. Taking the example of the Android 6.0 source code, the entire call flow of FileInputStream is as follows:

java: FileInputStream -> IoBridge.open -> Libcore.os.open 
-> BlockGuardOs.open -> Posix.open

In Libcore.java, you can find a good hook point, which is the BlockGuardOs static variable. How can we quickly find the appropriate hook point? It depends on experience to some extent, but patiently examining and analyzing the source code is an essential task.

public static Os os = new BlockGuardOs(new Posix());
// Reflectively obtain the static variable
Class<?> clibcore = Class.forName("libcore.io.Libcore");
Field fos = clibcore.getDeclaredField("os");

We can use dynamic proxy to add instrumentation code before and after all I/O related methods in order to track information related to I/O operations. In fact, there are also some Socket-related methods in BlockGuardOs that we can use to track network-related requests.

// Dynamic proxy object
Proxy.newProxyInstance(cPosix.getClassLoader(), getAllInterfaces(cPosix), this);

beforeInvoke(method, args, throwable);
result = method.invoke(mPosixOs, args);
afterInvoke(method, args, result);

This approach seems good, but in actual use, it quickly becomes apparent that this method has several drawbacks:

Very poor performance. I/O operations are called very frequently, and the use of dynamic proxy and a large number of string operations in Java leads to poor performance that does not meet production standards.
Unable to monitor Native code. For example, there are many I/O operations in WeiChat that are in Native code, which cannot be monitored using the Java Hook approach.
Poor compatibility. Java Hook needs to be compatible with each Android version, especially with the addition of restrictions on non-public APIs in Android P.

2. Native Hook

If the Java Hook cannot fulfill requirements, the next consideration would naturally be the Native Hook approach. Profilo uses the PLT Hook approach, which has slightly better performance than the GOT Hook, but the GOT Hook has better compatibility.

I will specifically introduce the various Native Hook implementation methods and their differences in a separate section later, so I won’t go into detail today. Eventually, the target functions for the Hook are selected from these functions in libc.so:

int open(const char *pathname, int flags, mode_t mode);
ssize_t read(int fd, void *buf, size_t size);
ssize_t write(int fd, const void *buf, size_t size);
int close(int fd);

Because we are using the GOT Hook, we need to choose libraries that call the above functions. In WeiChat Matrix, libjavacore.so, libopenjdkjvm.so, and libopenjdkjvm.so are chosen, which can cover all Java layer I/O calls. Specific details can be found in io_canary_jni.cc.

However, I would recommend the approach used in atrace.cpp in Profilo, which directly traverses all loaded libraries and replaces them together.

void hookLoadedLibs() {
  auto& functionHooks = getFunctionHooks();
  auto& seenLibs = getSeenLibs();
  facebook::profilo::hooks::hookLoadedLibs(functionHooks, seenLibs);
}

Different versions of the Android system have different implementations. After Android 7.0, we also need to replace these three methods:

open64
__read_chk
__write_chk

3. Monitoring Content

After implementing I/O tracking, we need to further consider what I/O information needs to be monitored. For example, when reading a file, we want to know the name of the file, the original size, the call stack of the open file, and which threads were used.

Then, we also want to know how long this operation took, what size buffer was used, whether it was a continuous read or random access. Using the above four hooked interfaces, we can easily collect this information.

Below are the basic information for an I/O operation on the main thread, where a “test.db” file with a size of 600KB was read.

A 4KB buffer was used, and the entire file was read 150 times, completing the entire file read in one go. The overall duration was 10ms because the read and write times were the same as the total time to open the file. We can determine that this read() operation was completed without interruption.

Because I/O operations are really frequent, how much impact does collecting so much information have on the performance of the application? Let’s see the timing data for using Native Hook.

You can see that the performance overhead of using Native Hook for monitoring methods can be ignored, so this solution can be used in production.

Online Monitoring #

Through the Native Hook method, we can capture all I/O-related information, but the amount of information collected is too large to be reported to the backend for analysis.

For online monitoring of I/O, we need to further abstract the rules and define which situations can be identified as adverse conditions and need to be reported to the backend, thereby driving the development team to resolve them.

1. Main Thread I/O

I have said more than once that sometimes the write operation of I/O can suddenly become large, even if it is a few hundred KB of data, it is still better not to operate on the main thread. On the production environment, we often find cases where I/O operations with small amounts of data still result in ANR.

Of course, if we collect all the main thread I/O, the amount of data will be very large. Therefore, I will add the condition “continuous read/write time exceeds 100 milliseconds”. The reason for using continuous read/write time is that many cases have been found where the file handle is opened but not read/written at once.

When reporting issues to the backend, in order to better locate and solve the problems, I usually also report CPU usage, information of other threads, and memory information to assist in problem analysis.

2. Insufficient Buffer Size for Read/Write

We know that for the file system, reading and writing are performed based on blocks, and for the disk, reading and writing are performed based on pages. It seems that even if we use a small buffer in the application, the difference should not be significant at the lower level. Is it really the case?

read(53, "*****************"..., 1024) = 1024       <0.000447>
read(53, "*****************"..., 1024) = 1024       <0.000084>
read(53, "*****************"..., 1024) = 1024       <0.000059>

Although the system call time for the latter two calls is indeed shorter, there is still a certain amount of time consumed. If our buffer is too small, it will result in multiple unnecessary system calls and memory copies, increasing the number of read/write operations, thereby affecting performance.

So how large should the buffer be? We can determine the buffer size based on the block size of the directory where the file is stored. The pagesize in the database is determined in this way.

new StatFs("/data").getBlockSize()

Therefore, the final judgment criteria we choose are:

Buffer size is smaller than the block size, which is generally 4KB.
Number of read/write operations exceeds a certain threshold, such as 5 times, mainly to reduce the amount of reporting.

The buffer size should not be less than 4KB, but is it the bigger, the better? You can do a simple test by using the following command to read the iotest file of the test application, which is 40MB in size. In the command, bs represents the buffer size, and different values of bs are used, then the time taken is observed.

// Manually release cache before each test
echo 3 > /proc/sys/vm/drop_caches
time dd if=/data/data/com.sample.io/files/iotest of=/dev/null bs=4096

From the above data, we can roughly see that the buffer size has a significant impact on the time taken for file read/write. The reduction in time consumed is mainly due to optimization of system calls and memory copying. I generally recommend using a buffer size of at least 4KB.

In practical applications, ObjectOutputStream and ZipOutputStream are very classic examples. ObjectOutputStream uses a very small buffer size. ZipOutputStream is slightly more complicated. If the file is stored in the Stored manner, it uses the buffer size passed from the upper layer. If the file is stored in the Deflater manner, it uses the buffer size of DeflaterOutputStream, which is 512 bytes by default.

You can see that using BufferInputStream or ByteArrayOutputStream can significantly improve overall performance.

As I mentioned in the previous issue, it is difficult to accurately estimate the actual number of disk reads and writes. There are also many strategies within the disk, such as prefetching. It may read more content than you actually read. When there are a large number of sequential disk reads, readahead can significantly improve performance. However, when reading a large number of fragmented small files, it may cause waste.

You can check the size of prefetching from the following file, which is generally 128KB.

    /sys/block/[disk]/queue/read_ahead_kb

Generally, we can use the information from /proc/sys/vm/block_dump or [/proc/diskstats](https://www.kernel.org/doc/Documentation/iostats.txt) to count the actual number of disk reads and writes.

/proc/diskstats
Block device name | Number of read requests | Number of read sectors | Total time for reads...
dm-0 23525 0 1901752 45366 0 0 0 0 0 33160 57393
dm-1 212077 0 6618604 430813 1123292 0 55006889 3373820 0 921023 3805823

3. Repeated Reads

When WeChat was undergoing modularization, due to the complete decoupling between modules, many modules would read common configuration files separately.

Some of you may say that when there are repeated reads, the data is obtained from the Page Cache and there are no real disk operations. However, it still consumes time for system calls and memory copying, and the memory of the Page Cache may also be replaced or released.

You can also use the following command to simulate the release of the Page Cache.

echo 3 > /proc/sys/vm/drop_caches

If a file is frequently read and this file has not been updated, we can use caching to improve performance. However, in order to reduce the amount of reporting, I will add the following conditions:

The number of repeated reads exceeds 3, and the content read is the same.
The file content has not been updated during the read, which means there has been no write operation.

Adding a layer of memory cache is the most effective way, and a typical scenario is loading data modules such as configuration files. If there is no memory cache, the performance impact will be relatively large.

public String readConfig() {
  if (Cache != null) {
     return cache; 
  }
  cache = read("configFile");
  return cache;
}

4. Resource Leaks

In crash analysis, I mentioned that some OOM issues are caused by file handle leaks. Resource leaks refer to not closing opened resources such as files and Cursors in a timely manner, leading to leaks. This is a very basic coding error, but it is very common.

How can we monitor resource leaks effectively? Here, I use the StrictMode in the Android framework. StrictMode utilizes the CloseGuard.java class, which has already been pre-instrumented in many system codes.

Now, let’s look at the source code to find a suitable hook point. This process is very simple. The REPORTER object in CloseGuard is a hook point that we can use. The specific steps are as follows:

Use reflection to set the ENABLED value in CloseGuard to true.
Use dynamic proxy to replace REPORTER with our defined proxy.

Although StrictMode in the Android source code has already pre-instrumented many resource hook points, there are certainly still some hook points that are missing, such as MediaPlayer and some internal resource modules. Therefore, in the program, I also wrote a MyCloseGuard class, which allows you to manually add instrumentation code for resources that you want to monitor.

I/O and Boot Optimization #

Through I/O tracing, we can obtain a detailed list of all I/O operations during the entire boot process. We need to rigorously examine each I/O call to determine if each one is necessary, particularly the write() operation.

Of course, main thread I/O, read/write buffers, repeated reads, and resource leaks need to be addressed first, especially repeated reads. Information such as cpuinfo and phone memory should be cached.

For essential I/O operations, we need to consider if there are other ways to optimize further.

Use mmap or NIO for large files. MappedByteBuffer is a wrapper for mmap in Java NIO. As mentioned in the previous issue, frequent reads and writes on large files can be greatly optimized.
Do not compress the installation package. For files needed during the boot process, we can specify that they should not be compressed in the installation package. This will speed up the boot process, but will increase the size of the installation package. In fact, Google Play strongly advises against compressing files such as library, resource, and resource.arsc, as it will greatly help in terms of memory and speed during boot. Moreover, not compressing files only increases the size of the installation package and does not affect the Download size from a user perspective.
Reuse buffers. We can utilize the open-source library Okio, which, through techniques such as buffer and ByteString reusing, significantly reduces CPU and memory consumption.
Optimize storage structures and algorithms. Can we optimize algorithms or data structures to minimize I/O or eliminate it altogether? For example, instead of fully parsing certain configuration files during startup, we can parse the corresponding entries only when reading them. We can also replace XML or JSON, which have redundant and relatively poor performance structures, with more efficient data structures. I will delve more into data storage in the upcoming issues.

In 2013, when I was optimizing Multidex, I discovered that the code first extracted classes2.dex from the APK file and then compressed it into classes2.zip. There was an unnecessary extraction and compression process for classes2.dex, which was completely unnecessary.

At that time, by studying the source code of the ZIP format, I found that as long as we can construct a file that conforms to the ZIP format, we can directly move the compression stream of classses2.dex into classes2.zip. The entire process does not involve any extraction or compression. This technique is also applied in the resource synthesis in Tinker.

Summary #

Today we learned how to monitor the usage of I/O at the application layer. We tried implementing two solutions: Java Hook and Native Hook. Considering performance and compatibility, we ultimately chose the Native Hook solution.

When choosing a Hook solution, I would prioritize the Java Hook solution under equal conditions. However, regardless of the chosen Hook solution, we need to patiently examine the source code and analyze the calling process in order to find places where we can make use of it.

The requirements for a monitoring solution are different when it is used only for laboratory automation testing compared to being directly used by users online. The latter requires 99.9% stability and high performance that does not affect the user experience in order to go live. From the laboratory to the online environment, extensive gray testing and repeated optimization iterations are necessary.

Exercise #

The performance monitoring and analysis tool for WeChat, Matrix, has finally been open sourced. Most of the content in this article is based on the analysis of matrix-io-canary. Today’s homework is to try integrating I/O Canary and check if your application has any I/O-related issues. Please share your experiences with your classmates in the comments section.

Do you find it very simple? I have an advanced exercise for you. In io_canary_jni.cc, it is discovered that Matrix currently only monitors the I/O behavior of the main thread, mainly to solve the problem of multithread synchronization.

//todo Solve the problem of non-main thread opening and main thread operation
int ProxyOpen(const char *pathname, int flags, mode_t mode) {

In fact, improper I/O usage by other threads can also affect the performance of the application. “todo=never do”. Today, I invite you to try to solve this problem. However, considering the impact on performance, we cannot simply add locks. Can we achieve completely lock-free thread safety for this case, or can we minimize the granularity of locks as much as possible? I invite you to study this problem together, submit a Pull Request to Matrix, and participate in open source projects.

Feel free to “Share with a friend” to share today’s content with your friends and invite them to learn together. Finally, don’t forget to submit today’s homework in the comments section. I have prepared a generous “study encouragement package” for students who complete the homework diligently. Looking forward to learning and progressing together with you.