07 Launch Optimization Part1 Looking at Launch Speed Optimization Through the Launch Process

07 Launch Optimization Part1 Looking at Launch Speed Optimization through the Launch Process #

Waiting in line at the supermarket checkout, the scanning payment takes more than ten seconds to complete. Should I switch to another tool for payment?

Want to buy a book to enrich myself, but I can’t do anything for more than ten seconds after the page loads. Should I switch to another app for purchasing?

When a user wants to open an app, they must go through the “startup” process. The length of the startup time is not just a matter of user experience. For apps like Taobao and JD.com, it directly affects key data such as retention and conversion. For developers, startup speed is our “face”. It is clearly visible to everyone, and we all hope that our app’s startup speed can outperform all competitors.

So, what problems can occur during the startup process? How can we optimize and monitor the startup speed of our applications? Today, let’s take a look at how to solve these issues.

Launch Analysis #

Before we start optimizing, we should first understand the key stages of the entire launch process, as well as the user experience issues that may arise from the moment the user clicks on the icon.

1. Launch Process Analysis

Taking WeChat as an example, there are 4 key stages that the user goes through from clicking on the icon.

  • T1 Preview window display. Before launching the WeChat process, the system creates a preview window based on WeChat’s Theme property. If we disable the preview window or set it to transparent, the user will still see the desktop during this time.

  • T2 Splash screen display. After the WeChat process and splash screen window page are created, and a series of preparation work such as inflating views, measuring, and layout are completed, the user can finally see the familiar “little earth”.

  • T3 Main page display. After the main window is created and the page is prepared for display, the user can see the main interface of WeChat.

  • T4 Page is interactive. After the launch is completed, WeChat needs to continue with several tasks, such as preloading the chat and Moments interface, preparing the mini-program framework and processes, etc. Only after these tasks are completed can the user start chatting.

2. Launch Issues Analysis

From the 4 key stages of the launch process, we can identify 3 issues that users may encounter during the launch process. These 3 issues are actually common problems that most applications may encounter during startup.

  • Issue 1: No response after clicking on the icon for a long time

If we disable the preview window or specify a transparent skin, it takes time until the application splash screen is actually displayed after the user clicks on the icon. From a user experience perspective, if they click on the icon and stay on the desktop for a few seconds, it looks like the click was not successful. This is even more noticeable on mid to low-end devices.

  • Issue 2: Slow display of the main page

Nowadays, application startup processes are becoming more complex, with splash screen advertisements, hotfix frameworks, plugin frameworks, and large front-end frameworks. All the preparation work needs to be completed during the startup phase. The T3 time mentioned earlier is a nightmare for mid to low-end devices, often taking more than ten seconds.

  • Issue 3: Unable to operate after the main page is displayed

Since the main page is displayed so slowly, can I postpone as much work as possible and execute them asynchronously? Many applications do just that, but this can result in two consequences: either a blank screen appears on the main page, or the user cannot operate the page after it appears.

Many applications start the timing for launch completion when the main page just appears, which is irresponsible to the user. Seeing a main page but being unable to scroll for more than ten seconds is completely meaningless to the user. Launch optimization should not be excessively KPI-driven, but should focus on the entire process from clicking the icon to the user being able to operate.

Startup Optimization #

The methods for optimizing startup speed are basically the same as those for optimizing stuttering, but because startup is so important, we need to be even more careful. We hope that every feature and business loaded during startup is necessary and that their implementation has been meticulously refined, especially for low-end devices.

1. Optimization Tools

“To do a good job, one must first sharpen one’s tools.” We need to find a tool that is suitable for analyzing startup optimization.

You can first recall the several tools mentioned in the stuttering optimization. Traceview has too much performance overhead and the results are not genuine. Nanoscope is very genuine, but currently it only supports Nexus 6P and x86 simulators, so it cannot be used for testing on low-end devices. The flame graph in Simpleperf is not suitable for analyzing the startup process. Systrace can conveniently trace the time consumption of key system calls, but it does not support the analysis of time consumption in application code.

Overall, the “systrace + function hooking” mentioned in the stuttering optimization seems to be an ideal approach, and it can also show some key events of the system such as GC, System Server, and CPU scheduling.

By running the following command, we can see the types of systrace supported by the phone. Different systems have different supported types, among which Dalvik, sched, ss, and app are the ones we are particularly interested in.

python systrace.py --list-categories

Through function hooking, we can see the function call flow of the main thread and other threads in the application. The implementation principle is very simple: the two functions below are inserted at the entrance and exit of each method.

class Trace {
  public static void i(String tag) {
    Trace.beginSection(name);
  }


  public static void o() {
      Trace.endSection();
  }
}

Of course, there are many details to consider, such as how to reduce the performance impact of hooking and which functions need to be excluded. The improved version of systrace has a performance overhead of less than twice, and it can basically reflect the real startup process. The effect after function hooking is as follows, and you can also refer to the sample in the homework exercises.

class Test {
  public void test() {
    Trace.i("Test.test()");
    // Original work
    Trace.o();
  }
}

Accurate data evaluation is essential to guide the direction of optimization. This step is very, very important. I have seen many students who did not fully evaluate or used the wrong methods for evaluation, and ultimately got the wrong direction. After working hard for one or two months, they found out that the expected results could not be achieved at all.

2. Optimization Methods

After obtaining the panorama of the entire startup process, we can clearly see the running status of the system, application processes, and threads during this period, and now we can really start “working”.

Specific optimization methods can be categorized into splash screen optimization, business analysis, business optimization, thread optimization, GC optimization, and system call optimization.

  • Splash Screen Optimization

Today’s Headline implements the preview window as a splash screen effect, so users can see the “preview splash screen” in a very short time. This completely “follows” feeling is very good when experienced on high-end devices, but it will make the total splash screen time longer for mid- to low-end devices.

If clicking the icon does not respond, users subjectively perceive that the phone system is responding slowly. Therefore, the approach I recommend is to enable the “preview splash screen” only on Android 6.0 or above, so that users with good device performance can have a better experience.

WeChat has another optimization, which is to merge the splash screen and main page Activity, reducing one Activity can bring about 100 milliseconds of optimization online. However, if this is done, the management will be very complex, especially when there are many third-party startup processes such as PWA and scanning.

  • Business Analysis

First, we need to clarify every module running in the current startup process, which ones are definitely needed, which ones can be removed, and which ones can be lazily loaded. We can also decide on different startup modes based on business scenarios, such as only loading a few necessary modules through scanning. For mid- to low-end devices, we need to learn to downgrade and push product managers to make some functional choices. However, it should be noted that lazy loading should be decentralized to prevent the situation where users cannot operate after the home page is displayed.

  • Business Optimization After sorting through, what’s left are the modules that must be used in the startup process. At this point, we can only grit our teeth and continue with further optimization. In the early stages of optimization, we need to “trim the fat” and first identify the main thread’s bottlenecks. Ideally, we can optimize through algorithms; for example, if a data decryption operation takes 1 second, we can optimize it to 10 milliseconds through algorithm optimization. Alternatively, we can consider whether these tasks can be preloaded asynchronously using separate threads, but it should be noted that excessive thread preloading will make our logic more complex.

As we optimize the business, we will find that some architectural and historical baggage will hinder our progress. One common issue is that events are listened to by multiple business modules, and a large number of callbacks cause many tasks to be executed concurrently. Some frameworks have “heavy” initialization, such as some plug-in frameworks, where the startup process involves various reflections and hooks, which can take at least a few hundred milliseconds. There are also some heavy historical burdens that have significant impact and changing them carries a higher risk. However, if the timing is appropriate, we still need to be brave and repay these “historical debts”.

  • Thread Optimization

Thread optimization is like solving a fill-in-the-blank and unlocking puzzle. We hope to utilize all the time slices, so both the main thread and other threads are always fully loaded. of course, we also want each thread to run at full speed, rather than being a relay baton. Therefore, the main focus of thread optimization is to reduce the fluctuations caused by CPU scheduling and make the application’s startup time more stable.

In terms of specific methods, thread optimization involves controlling the number of threads. Having too many threads will result in competition for CPU resources, so it is necessary to have a unified thread pool and control the number of threads according to machine performance.

We can use the sched file from the lag optimization to check the data for thread context switches. It is especially important to pay attention to the number of involuntary switches (nr_involuntary_switches).

proc/[pid]/sched:
  nr_voluntary_switches:    
  Number of voluntary context switches, caused by threads being unable to obtain the required resources and resulting in a context switch. The most common situation is I/O.
  nr_involuntary_switches:  
  Number of involuntary context switches, caused by threads being forcibly scheduled by the system and resulting in a context switch, for example, when a large number of threads are preempting the CPU.

Another aspect is to check the locks between threads. In order to improve the speed of tasks executed during the startup process, we placed a time-consuming task from the main thread into a thread for concurrent execution. However, we found that it had no effect. After careful inspection, we found that the thread internally held a lock, causing other tasks in the main thread to wait for this lock. The events of lock waiting can be seen through systrace, and we need to investigate whether these waits can be optimized, especially to prevent long periods of idling for the main thread.

In particular, there are now many startup frameworks that use a Pipeline mechanism, which establishes dependencies based on business priorities during initialization. For example, mmkernel, used internally by WeChat, and Alpha, a startup framework recently open-sourced by Alibaba, create dependency graphs that are directed and cycle-free. For tasks that can be executed in parallel, the startup speed is maximized through the use of a thread pool. If the dependency relationships for tasks are not properly configured, situations like the one shown in the figure below can easily occur, where the main thread will wait for task C to finish, resulting in 2950 milliseconds of idling.

  • GC Optimization

During the startup process, we should try to minimize the number of GC cycles to avoid long periods of lag on the main thread. This is especially true for Dalvik, where we can use systrace to specifically examine the time spent on GC during the entire startup process.

python systrace.py dalvik -b 90960 -a com.sample.gc

For the specific meanings of various GC events, you can refer to Investigate RAM Usage.

I don’t know if you still remember when I mentioned Debug.startAllocCounting in “Memory Optimization”. We can use it to monitor the total time spent on GC during the entire startup process, especially the number and duration of blocking synchronized GCs.

// Total time spent on GC, in milliseconds
Debug.getRuntimeStat("art.gc.gc-time");
// Total time spent on blocking GC
Debug.getRuntimeStat("art.gc.blocking-gc-time");

If we find that the main thread has a significant amount of GC synchronous waiting, then further analysis must be done using Allocation tools. During the startup process, avoid performing a large number of string operations, especially during serialization and deserialization processes. Some objects that are frequently created, such as byte arrays and buffers in network libraries and image libraries, can be reused. If certain modules require frequent object creation, consider moving them to native implementations.

The escape of Java objects can easily lead to GC issues, and we tend to overlook this when writing code. We should ensure that the lifecycle of objects is as short as possible, and that they are destroyed on the stack when possible.

  • System Call Optimization

Through the System Service type in systrace, we can see the CPU usage of the System Server during the startup process. During the startup process, we should avoid making system calls, such as PackageManagerService operations and Binder invocations that require blocking.

During the startup process, it’s also not advisable to start other processes of the application too early, as both the System Server and the new process will compete for CPU resources. Especially when the system is low on memory, starting a new process may become the “straw that breaks the camel’s back”. It may trigger the system’s low memory killer mechanism, causing the system to kill and startup (or keep alive) a large number of processes, thus affecting the CPU of foreground processes.

Let me give you a practical case. Previously, one of our programs would start download and video playback processes during startup. After switching to on-demand startup, the online startup time was improved by 3%, and for low-end machines with less than 1GB of memory, the entire startup time can be optimized by 5% to 8%, which is a very noticeable effect.

Summary #

Today we first learned about the entire startup process, in which there are four key stages. In these four stages, users may encounter three issues: “clicking the icon without response for a long time,” “homepage loading too slowly,” and “unable to operate after the homepage is displayed.”

Then we learned some common methods for startup optimization and monitoring. Different strategies need to be used for different business scenarios and machines with different performances. Some knowledge points may seem superficial, and I hope you can enrich them through learning and practice.

Most of the content I covered is related to business. Streamlining and optimizing the business is also the fastest way to achieve results. However, in this process, we need to learn to make trade-offs. You may have encountered situations where many product managers force developers to do various preloads in order to improve the data of the modules they are responsible for. But everyone wants it fast, and the result is usually messy code, which definitely cannot be fast.

For example, for a feature that only 1% of users use, all users are forced to do preloading. In the face of this situation, we need to be ruthless and only keep the business that cannot be deleted, or directly find the 1% of users through scenario-based approach. Competing with product managers may not be easy, and the key lies in the data. We need to prove that the positive value brought by startup optimization in terms of overall retention and conversion is greater than the negative impact brought by canceling preloading for a certain business.

Startup optimization is an important part of performance optimization work. Today’s homework is to share in the comment section what optimization you have done for startup in your past work and what the final effect was. Please share your gains and experiences from today’s learning and practice.

Exercise #

“Excellence in work requires good tools.” I have mentioned many times that “systrace + function instrumentation” is a very good tool for troubleshooting stuttering. So let’s take a look at how it is implemented through today’s Sample. It is worth noting that Sample chooses to use ASM instrumentation. Interested students can study its usage methods after class, and we will also have a dedicated course on instrumentation in the future.

We can apply Sample to our own applications. Although it filters out most functions, we still need to pay attention to the configuration of the whitelist. For example, functions that are frequently called in the low layer, such as log and encryption and decryption, need to be filtered out by configuring them in the whitelist. Otherwise, there may be a large number of spikes as shown below.

You are welcome to click “Please Share with Friends” to share today’s content with your friends and invite them to learn together. Finally, don’t forget to submit today’s homework in the comments section. I have prepared a rich “Study Boost Gift” for students who complete the homework seriously. I look forward to learning and progressing together with you.