08 Launch Optimization Part2 Progressive Methods for Optimizing Launch Speed

08 Launch Optimization Part2 Progressive Methods for Optimizing Launch Speed #

In the previous article, we discussed the entire process and issues related to application startup, and also shared some startup optimization methods. This can be considered as tackling the most difficult part of startup optimization. To further optimize specific business functionalities, unnecessary tasks can be removed or deferred. After learning these tools and methods, you must have felt the great effectiveness and proudly reported your achievements to your boss, saying, “Startup speed improved by 30%, surpassing all competitors.”

“But are there any further methods for optimization? How can you prove that you have surpassed all competitors? How can the effect of startup optimization be measured in a live environment? How can we ensure and monitor that the startup speed does not become slower?” Your boss bombarded you with four questions.

Faced with these four questions, you can’t be clueless. Have we really achieved the best possible application startup? How can we ensure that the results of startup optimization are long-lasting? Let’s answer these questions today through our learning.

Advanced Startup Methods #

In addition to the conventional optimization methods mentioned in the previous article, I have some “hidden” methods unrelated to the business that can help speed up the application startup process. Of course, some of these methods involve “black technology,” which is like a double-edged sword and requires thorough evaluation and testing.

1. I/O Optimization

When the workload is high, the performance of I/O decreases faster. Especially for low-end machines, the same I/O operation may take dozens of times longer than high-end machines. Network I/O is not recommended during the startup process. In comparison, disk I/O is an important point to focus on for startup optimization. First, we need to understand what files are being read during the startup process, how many bytes, the size of the buffer, the length of time used, and on which thread, among other information.

So how do we monitor I/O? Today, I’ll keep you guessing and discuss I/O in detail in the next article.

From the data above, we can see that the size of chat.db is as large as 500MB. We often find that the local startup is very fast, so why is it so slow for some users online? This may be because some users have accumulated a large amount of data locally. We also found that some heavy users of WeChat have DB files that can exceed 1GB. Therefore, heavy users are a group that must be covered in startup optimization, and we need to implement some special optimization strategies for them.

Another issue is the choice of data structure. During the startup process, we only need to read several data items from Setting.sp, but SharedPreference still needs to parse all the data during initialization. If the data in SharedPreference exceeds 1000 items, the parsing time during startup may exceed 100 milliseconds. Parsing only the data items used during startup would greatly reduce the parsing time. Random-access data structures are suitable for the startup process.

The ArrayMap can be transformed into a data storage method that supports random access and delayed parsing. Today, I won’t go into detail on this part. These topics will be further explored in the relevant sections on storage optimization.

2. Data Rearrangement

In the table above, we read 1KB of data from the test.io file, but the buffer was mistakenly set to 1 byte, so we have to read it 1000 times. Does the system really read from the disk 1000 times?

In fact, the number of read operations is just the number of times we initiated the operations and does not represent the actual number of disk I/Os. You can refer to the Linux file I/O process below.

When the Linux file system reads a file from the disk, it reads it in blocks. Generally, the block size is 4KB. This means that each disk I/O operation reads or writes at least 4KB of data and puts the data into the page cache. If the data to be read from the file is already in the page cache, there will be no actual disk I/O operation, and the data will be read directly from the cache, greatly improving the read speed. In the example above, although we read 1000 times, in reality, only one disk I/O operation occurs because all the other data is obtained from the page cache.

Classes used in the Dex file and various resource files in the installed APK are generally small but frequently read. We can use this system mechanism to rearrange them in the order they are read, reducing the number of actual disk I/O operations.

Class Rearrangement

The order of class loading during the startup process can be obtained by overriding the ClassLoader.

class GetClassLoader extends PathClassLoader {
    public Class<?> findClass(String name) {
        // Record 'name' to a file
        writeToFile(name，"coldstart_classes.txt");
        return super.findClass(name);
    }
}

Then, we can adjust the order of classes in the Dex file using Facebook’s Interdex, and finally, we can view the modified effect using 010 Editor.

I have mentioned ReDex many times. It is a Dex optimization tool open-sourced by Facebook, which includes many useful features. We will have a more detailed introduction to it later on.

Resource File Rearrangement

Facebook used a “hierarchical heatmap” to implement resource file rearrangement early on. Recently, Alipay also detailed the principles and implementation methods of resource rearrangement in “Optimizing Android App Startup Performance by Reorganizing the Installation Package”.

In practice, both companies modified the kernel source code and compiled a special ROM. The purpose of doing so is threefold:

Statistics: Statistics of which resource files in the installation package are loaded during the application startup process, such as assets, drawable, layout, etc. Similar to class rearrangement, we can obtain a list of resource loading order.
Measurement: After rearranging the resource order, we need to determine whether it has actually taken effect. For example, which resource files were loaded, whether they caused actual disk I/O or hit the page cache.
Automation: Any code submission may change the loading order of classes and resources during startup. It would be difficult to rely entirely on manual processing. By customizing some tracking points in the ROM and using related tools, we can include them in the automated process.

Similar to Nanoscope, which was mentioned earlier for time-consuming analysis, when the system cannot meet our optimization requirements, it is necessary to modify the implementation of the ROM directly. Facebook’s “hierarchical heatmap” is relatively comprehensive and has built some supporting dashboard tools, so I hope they can be open-sourced in the future.

In fact, if we only need statistics, we can also use the hook method. The following method uses Frida to obtain the order of Android resource loading. However, Frida is still relatively niche, and we will replace it with a more mature hook framework later.

resourceImpl.loadXmlResourceParser.implementation = function(a, b, c, d) {
  send('file:' + a)
  return this.loadXmlResourceParser(a, b, c, d)
}

resourceImpl.loadDrawableForCookie.implementation = function(a, b, c, d, e) {
  send("file:" + a)
  return this.loadDrawableForCookie(a, b, c, d, e)
}

To rearrange the order of installation package files, it is necessary to modify the source code of 7zip to support passing in a file list in a specific order. Additionally, the modified effect can be viewed using 010 Editor.

These two optimizations may result in an improvement of 100-200 milliseconds. We can also greatly reduce the time fluctuations of the startup process I/O. Especially for low-end and mid-range devices, it is often found that the startup time fluctuates significantly. This fluctuation is related to CPU scheduling, but more often with I/O.

You may wonder how these optimization ideas came about. In fact, the optimization ideas based on file system and disk read mechanisms are not new in the server and Windows platforms. Innovation does not necessarily mean creating something completely new. Transferring existing solutions to new platforms and combining them well with the characteristics of the platform to implement them is a great innovation.

3. Class Loading

In the article named Practical Evolution of WeChat Android Hotfix published on WeMobileDev, I mentioned that there is a “verify class” step in the process of loading classes, which involves verifying each instruction of a method and is a time-consuming operation.

We can remove the “verify” step through hooking, which results in an optimization of several tens of milliseconds in startup time. However, it should be noted that the greatest optimization scenarios lie in the first launch and overwriting installation. Taking Dalvik platform as an example, a 2MB Dex file normally takes 350 milliseconds to load, but with the classVerifyMode set to VERIFY_MODE_NONE, it only takes 150 milliseconds, saving more than 50% of the time.

However, ART platform is much more complex, and the hook needs to be compatible with several versions. Moreover, most Dex files are already optimized during installation, and removing the verify step on ART platform only benefits dynamically loaded Dex files. The dalvik_hack-3.0.0.5.jar in Atlas can remove “verify” through the following method, but it currently does not support ART platform.

AndroidRuntime runtime = AndroidRuntime.getInstance();
runtime.init(context);
runtime.setVerificationEnabled(false);

This trick can greatly reduce the startup time, but it has a minor impact on subsequent operation. The compatibility issue should also be taken into consideration, and using it on ART platform is not recommended for the time being.

4. Tricks

First, Keeping Alive

When it comes to tricks, you may immediately think of keeping the app alive. By doing so, the creation and initialization time of the Application can be reduced, turning cold start into warm start. However, it has become more and more difficult to keep apps alive after Android Target 26.

For big companies, they may need to seek opportunities for cooperation with manufacturers, such as WeChat’s Hardcoder solution and OPPO’s Hyper Boost solution. According to OPPO’s data, the optimization for QQ, Taobao, and WeChat startup scenarios can directly reach over 20%.

Sometimes you may wonder why WeChat can stay alive and run so smoothly. It is not only a technical problem. When the size of an app becomes large enough, it can push manufacturers to optimize specifically for the app.

Second, Pluginization and Hot Patching

Starting from 2012, Taobao and WeChat began to explore pluginization. By 2015, various hot patching technologies such as Dexposed by Taobao, AndFix by Alipay, and Tinker by WeChat emerged.

Are they really as good as they claim? In fact, most frameworks have a large amount of hooking and private API call in their designs, resulting in two main disadvantages:

Stability. Although it is claimed that they are compatible with 100% of the models, due to vendor’s compatibility, installation failure, dex2oat failure, and other reasons, there will still be some exceptions related to code and resources. The non-sdk-interface call restriction introduced in Android P will make future adaptation increasingly difficult and costly.
Performance. Android Runtime has a lot of optimizations in every version. However, due to the use of some tricks and hacks in pluginization and hot patching, we cannot enjoy the underlying Runtime optimizations. After loading a patch, the Tinker framework can reduce the app startup speed by 5%-10%.

Application shielding is a disaster for startup time. Sometimes we need to weigh and choose. In order to improve startup time, Alipay has also proposed a GC suppression solution. However, first, the proportion of systems below Android 5.0 is already low, and second, it brings some compatibility issues. We still hope that we can truly optimize the entire time-consuming process rather than resorting to some shortcuts.

Overall, we need to be cautious about using tricks. After fully understanding the internal mechanism of the tricks, we can selectively use them.

Launch Monitoring #

After all the hard work of optimization, we need to find a reasonable and accurate method to measure the results of the optimization. At the same time, we need to monitor it comprehensively to prevent others from destroying our hard work.

1. Lab Monitoring

If we want an objective reflection of the startup time, video recording is a very good choice. Especially when it is difficult for us to obtain online data from competitors, lab monitoring is also very suitable for comparative testing with competitors.

The difficulty lies in how to accurately identify the end of the startup process in the lab system, which can be achieved in two ways:

80% rendering. When more than 80% of the page rendering is completed, it is considered as the end of the startup process. However, the splash screen may be mistaken as the end of the startup, which may not be what we expect.
Image recognition. Manually input an image that represents the end of the startup process. When the lab system determines that the current screenshot has more than 80% similarity to the image, it is considered as the end of the startup process. This method is more flexible and controllable, but the implementation difficulty is slightly higher.

The lab monitoring of the startup can be automatically run regularly. It should be noted that we should cover different scenarios of high, medium, and low-end devices. However, using screen recording also has a flaw, which is that when problems occur, we need to manually locate the specific code that caused the issue.

2. Online Monitoring

The lab only covers limited scenarios and device models, but we still need to publish the product online to verify its performance. For online monitoring, startup monitoring becomes more complex. Android Vitals can monitor the cold and warm startup times of an application.

In fact, the startup process of each application is very complex, and the above diagram does not truly reflect the startup time of each application. Calculating the startup time needs to consider many details, such as:

Timing of the startup end. Whether to use the time that users can actually operate as the end of the startup.
Deduction logic for the startup time. The time used for splash screens, advertisements, and new user guides should be deducted from the startup time.
Exclusion logic for the startup. Broadcast, server launch, and entering the background during the startup process need to be excluded from the statistics.

Through precise deduction and exclusion logic, we can finally obtain the online startup time for users. As I mentioned in the previous article, accurate startup time statistics are very important. After many optimizations are completed in the lab, we still need to verify the effects with online gray tests. This premise is based on accurate startup statistics and a realistic evaluation of the overall effect.

So, what metrics do we generally use to measure the speed of startup?

Many applications use average startup time, but this metric is not very reliable because it may be skewed by users with poor experiences. I suggest using metrics similar to the following:

Fast to slow ratio. For example, a 2-second fast to slow ratio and a 5-second slow to fast ratio. We can see what proportion of users have a very good experience and what proportion of users have a relatively poor experience.
Startup time for 90% of users. If the startup time for 90% of users is less than 5 seconds, then the startup time in the 90% interval is 5 seconds.

In addition, we also need to distinguish between different types of startup. Here, we need to track first-time installation startup, update installation startup, cold startup, and warm startup. Generally, we use the cold startup time as the measurement metric. On the other hand, the proportion of warm startups can also reflect the program’s activity and stay-alive capability.

Apart from monitoring metrics, monitoring the online stack trace of the startup is even more difficult. Facebook uses the Profilo tool to monitor the entire startup process’s time consumption and automatically compares different versions in the background to monitor if new versions have additional time-consuming functions.

Summary #

Today we learned some startup optimization methods that are unrelated to business, which can further reduce startup time, especially reducing the fluctuations that disk I/O may bring. Then we discussed the impact of some “black technology” on startup. When it comes to “black technology,” we need to consider both the pros and cons and make careful choices. Finally, we explored how to better measure and monitor startup speed in the laboratory and online.

Optimizing startup requires patience and thorough understanding of the entire process. We need to carefully chip away the time, especially for low-end devices and busy systems. The optimization of data reordering has greatly inspired me and opened up a new direction. It has also made me realize that when we are familiar enough with the underlying knowledge, we can use the characteristics of the system to achieve deeper optimization.

Regardless, you should always remember: when it comes to startup optimization, we need to be cautious about focusing too much on KPIs. What we need to solve is not just a number, but the true user experience problem.

After reading the startup optimization methods I shared, I believe you must have many good ideas and methods. The homework for today is to share your “hidden” startup optimization “cheats” in the comment section, and share your learnings, exercises, and experiences from today’s class.

Exercise after class #

Today, our Sample is about how to remove the verify step in Dalvik. You can follow this idea to analyze the process of the Dalvik virtual machine loading Dex and classes.

Feel free to click “Ask friend to read” and share today’s content with your friends, inviting them to study together. Finally, don’t forget to submit today’s homework in the comments section. I have also prepared a generous “learning encouragement package” for students who complete the homework seriously. Looking forward to making progress together with you.