05 Stutter Optimization Part1 Stutter Analysis Methods You Need to Master

05 Stutter Optimization Part1 Stutter Analysis Methods You Need to Master #

“Why can’t my Hou Yi move?” When playing “King of Glory,” I always dread encountering lag during team fights, where it feels like I’m flipping through slides. The same goes for applications — we often hear users complain, “Why is this app starting up so slowly?” “Why is it lagging so much when I’m scrolling?”

For users, high memory usage, battery consumption, and data usage may not be easily noticed, but they are particularly sensitive to lag and can easily perceive it. On the other hand, for developers, lag issues are very difficult to troubleshoot and locate. The underlying causes are complex, involving CPU, memory, disk I/O, and are greatly influenced by the user’s system environment at the time.

So how do we define lag? What tools are available locally to help us better detect and troubleshoot issues? What are the differences between these tools? Today, I will help you resolve these doubts.

Basic Knowledge #

Before discussing tools for troubleshooting latency issues, you need to understand some basic concepts related to the CPU. There can be countless reasons for latency issues, but they all ultimately reflect on CPU time. CPU time can be divided into two types: user time and system time. User time refers to the time consumed by executing user-level application code, while system time refers to the time consumed by executing kernel-level system calls, including I/O, locks, interrupts, and other system calls.

1. CPU Performance

Let’s begin by briefly discussing CPU performance. Considering factors such as power consumption and size, there are differences between mobile and PC CPUs. However, in recent years, the performance of mobile CPUs has been rapidly approaching that of PCs. For example, Huawei Mate 20’s “Kirin 980” and iPhone XS’s “A12” have already adopted the leading 7-nanometer process, surpassing PCs.

Evaluating CPU performance involves considering parameters such as clock speed, number of cores, and cache. The specific performance is manifested in terms of computational capability and instruction execution capability, i.e., floating-point calculations executed per second and instructions executed per second.

Of course, architecture also needs to be taken into account. The “Kirin 980” adopts a three-tiered energy efficiency architecture, consisting of two 2.6GHz high-performance A76 cores, two 1.92GHz large A76 cores, and four 1.8GHz small A55 cores. In comparison, the “A12” utilizes a design with two performance cores and four efficiency cores. This design is mainly for conserving power when operating under low loads. During development, we can obtain CPU information for a device using the following methods:

// Get the number of CPU cores
cat /sys/devices/system/cpu/possible

// Get the frequency of a specific CPU
cat /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq

As machine learning has gained popularity, modern chips are equipped not only with powerful GPUs but also with NPUs (Neural Processing Units) specifically designed for neural network computations. The “A12” includes an eight-core NPU capable of performing fifty trillion operations per second. From CPUs to GPUs and now to AI chips, with the overall increase in mobile CPU performance, AI applications such as medical diagnosis and image enhancement can better be implemented on mobile devices. Edge computing has also gained more attention recently, allowing us to make better use of the computing capabilities of mobile devices and reduce the high costs of servers.

Therefore, during the development process, we need to base our decisions on the device’s CPU performance. For example, the number of threads used in a thread pool should be based on the number of CPU cores, and advanced AI features should only be enabled on devices with high clock frequencies or NPUs.

After discussing all these aspects, let’s return to the CPU time I mentioned earlier, which comprises user time and system time. When encountering latency issues, how can we differentiate between issues caused by our code and those caused by the system? What clues can user time and system time provide us? Here, two very important indicators come into play, helping us make judgments.

2. Indicators for Analyzing Latency Issues

When facing latency issues, the first step is to check the CPU usage. How can we do this? We can obtain the overall CPU usage of the system through /proc/stat and the CPU usage of a specific process through /proc/[pid]/stat.

The meanings of the different attributes in the stat file and the calculation of CPU usage can be found in Linux CPU Performance and the Linux documentation. Some of the important fields include:

proc/self/stat:
  utime:       User time, reflecting the time taken by the user code's execution
  stime:       System time, reflecting the time taken by system call execution

  majorFaults: Number of page faults that require disk copying
  minorFaults: Number of page faults that do not require disk copying

If the CPU usage rate is consistently greater than 60%, it indicates that the system is in a busy state, and further analysis of the user time-to-system time ratio is needed. For normal applications, the system time will not exceed 30% for an extended period of time. If it exceeds this value, we should further investigate whether there is excessive I/O or other system call issues.

Android stands on the shoulders of the Linux giant, and although it has made many modifications and cut off some tools, it still retains many useful tools to assist us in troubleshooting more easily. Here I will introduce several commonly used commands. For example, the top command can help us identify which process is consuming CPU; the vmstat command can dynamically monitor virtual memory and CPU activity of the operating system in real-time; the strace command can trace all system calls within a specific process.

In addition to CPU usage, we also need to examine CPU saturation. CPU saturation reflects the situation where threads are waiting in a queue for CPU, in other words, the load on the CPU.

CPU saturation is first related to the number of threads in the application. If too many threads are started, it can lead to constant switching of executing threads, wasting a lot of time on context switching. We know that every CPU context switch requires the refreshing of registers and counters, taking at least tens of nanoseconds.

We can use the vmstat command or the /proc/[pid]/schedstat file to view the number of CPU context switches, paying particular attention to nr_involuntary_switches, which represents the number of involuntary context switches.

proc/self/sched:
  nr_voluntary_switches: 
  Number of voluntary context switches where a thread had to give up the CPU because it couldn't get the required resources, most commonly I/O. 
  nr_involuntary_switches:   
  Number of involuntary context switches where a thread was forcefully scheduled by the system, such as when there are too many threads competing for the CPU.
  se.statistics.iowait_count: Number of times waiting for I/O.
  se.statistics.iowait_sum: Time spent waiting for I/O.

Additionally, the uptime command can also be used to check the average load on the CPU in the last 1 minute, 5 minutes, and 15 minutes. For example, for a 4-core CPU, if the current average load is 8, it means that one thread is running on each CPU while one thread is waiting. It is generally recommended to keep the average load within “0.7 × the number of cores”.

00:02:39 up 7 days, 46 min,  0 users,  
load average: 13.91, 14.70, 14.32

Another factor that affects CPU saturation is thread priority. Thread priority affects the scheduling strategy of the Android system and is mainly determined by the nice value and cgroup type. The lower the nice value, the greater the ability of the thread to preempt CPU time slices. When the CPU is idle, the impact of thread priority on execution efficiency may not be particularly noticeable, but when the CPU is busy, thread scheduling can have a significant impact on execution efficiency.

Regarding thread priority, you need to pay attention to whether there are high-priority threads waiting for low-priority threads, for example, the main thread waiting for a lock held by a background thread. From the perspective of the application, whether it is user time, system time, or waiting for CPU scheduling, they all represent the time spent during program execution.

Android Performance Profiling Tools #

You may find it troublesome to use various Linux commands to troubleshoot issues. Is there a simpler, graphical interface to use? Traceview and systrace are the tools we are familiar with for troubleshooting performance issues. From their implementation, these tools can be divided into two categories.

The first category is “instrument”. It captures the invocation process of all functions during a specific period of time. By analyzing the function call flow during this period of time, we can further analyze the points that need optimization.

The second category is “sample”. It selectively or randomly observes the invocation process of certain functions. Through this limited information, suspicious points in the flow can be inferred, and then further analysis can be conducted.

What are the differences between these two categories? When should we choose the appropriate tool for a specific scenario? Are there any other useful tools to use? Let’s take a look one by one.

1. Traceview

Traceview was the first performance analysis tool I used and also the one with complaints. It utilizes the event events of Android Runtime function calls and writes the runtime duration and invocation relationships of functions into a trace file.

Traceview belongs to the “instrument” category. It can be used to view which functions are called during the entire process. However, the tool itself brings significant performance overhead and sometimes cannot reflect the real situation. For example, the runtime of a function itself is 1 second, but it may become 5 seconds when Traceview is enabled. Moreover, the changes in the runtime of these functions are not proportional.

Starting from Android 5.0, the startMethodTracingSampling method was introduced to perform analysis using a sampling approach to reduce the performance impact on the runtime. After the introduction of the “sample” category, we need to strike a balance between overhead and information richness.

Both types of Traceview do not support release packages well, for example, they cannot handle obfuscation. In fact, the format of trace files is very simple. I previously wrote a small tool that supports deobfuscating traces using a mapping file.

2. Nanoscope

Are there any performance analysis tools in the “instrument” category with less performance overhead?

The answer is yes. Nanoscope, an open-source project by Uber, achieves this effect. Its implementation principle is to directly modify the source code of the Android virtual machine, add instrumentation code at the entry and exit positions of ArtMethod execution, and write all the information to memory first. The result file is generated only when the tracing ends.

During use, you can clearly feel that the app does not become sluggish when Nanoscope is enabled, but it takes longer to generate the result file after the tracing ends. On the other hand, it can be used to analyze any app and is useful for competitor analysis.

However, Nanoscope also has some limitations:

You need to flash a custom ROM, and currently, it only supports Nexus 6P or x86 architecture emulators provided by Nanoscope.
By default, it only supports collecting traces on the main thread. Other threads need to be manually configured in the code. Considering the limit of memory size, each thread’s memory array can only support a time span of about 20 seconds.

Uber has developed a series of automated scripts to assist the entire process, making it relatively easy to use. As a low-overhead instrument tool, Nanoscope is particularly suitable for automated analysis of startup time.

Nanoscope generates HTML files that comply with the Chrome tracing specification. We can use scripts to achieve two functions:

The first is deobfuscation. The result file can be automatically deobfuscated using the mapping file.

The second is automated analysis. By inputting the same start and end points, the differences between two result files can be automatically analyzed using the diff tool.

This way, we can run automated startup tests regularly to check for any newly added time-consuming points. Sometimes, in order to achieve more customized functionality or obtain more detailed information, we have to use a custom ROM. Nanoscope is a great tool that makes it easier for us to create custom ROMs. I will mention more similar cases of startup and I/O optimization later. 3. systrace

systrace is a performance analysis tool introduced in Android 4.1. I usually use systrace to track I/O operations, CPU loads, Surface rendering, GC, and other events in the system.

systrace utilizes the ftrace debugging tool from Linux, which adds performance probes at critical points in the system. It adds performance monitoring points in the code. Based on ftrace, Android has wrapped it with atrace and added more specific probes, such as Graphics, Activity Manager, Dalvik VM, System Server, etc.

The systrace tool can only monitor the time consumed by specific system calls, so it belongs to the sample type and has a very low performance overhead. However, it does not support the analysis of application code execution time, so there are limitations when using it.

Since the system reserves the Trace.beginSection interface to monitor the execution time of application calls, is there a way to automatically add application performance analysis to systrace?

Highlighting, we can achieve this by instrumenting each function at compile time, adding Trace.beginSection at the entry of important functions and Trace.endSection at the exit. Of course, for performance reasons, we filter out most functions with fewer instructions, so we can monitor the application execution time on top of systrace. There are advantages to this approach:

We can see the entire system and application execution flow. This includes function calls from critical system threads, such as rendering time, thread locks, GC time, etc.
The performance overhead is acceptable. Since most short functions are filtered out and there is no amplification of I/O, the overall runtime is less than twice the original, which basically reflects the real situation.

systrace generates HTML-formatted results, and we can implement support for deobfuscation similar to Nanoscope.

4. Simpleperf

If we want to analyze the calls of Native functions, none of the three tools mentioned above can meet this requirement.

Android 5.0 introduced the Simpleperf performance analysis tool, which utilizes the hardware perf events provided by CPU’s Performance Monitoring Units (PMU). With Simpleperf, we can see the execution time of all Native code. Sometimes, calls to Android system libraries are helpful for analyzing issues, such as time spent loading dex or verifying classes.

Simpleperf also includes the monitoring feature of systrace. With several optimizations in different Android versions, Simpleperf now supports Java code performance analysis quite well. It can be divided into several stages:

First stage: Before Android M, Simpleperf does not support Java code analysis.

Second stage: Between Android M and earlier versions of Android O, manual specification of OAT files is required.

Third stage: From Android P onwards, no action is needed, and Simpleperf supports Java code analysis out of the box.

From this process, we can see that Google attaches importance to this feature. Android Studio 3.2 also directly supports Simpleperf in its Profiler.

As the name suggests, Simpleperf belongs to the sample type, and its performance overhead is very low. It uses flame graphs to display analysis results.

Currently, except for Nanoscope, the other three tools only support debuggable applications. If you want to test release packages, you need to root the testing device. To overcome this limitation, we usually create debuggable test packages for practical purposes and implement our own deobfuscation functionality based on the mapping file. Implementing deobfuscation for Simpleperf is relatively difficult because the parameters are discarded after function aggregation, making it difficult to directly modify the generated HTML file. Of course, we can also create our own automated testing tools that support non-debuggable applications based on the implementation ideas of various tools.

The choice of tool depends on the specific scenario. Let me summarize: If you need to analyze the execution time of Native code, you can choose Simpleperf; if you want to analyze system calls, you can choose systrace; if you want to analyze the execution time of the entire program flow, you can choose either Traceview or the instrumented version of systrace.

Visualization Methods #

With the evolution of Android versions, Google not only provides more performance analysis tools but also gradually improves the user experience of existing tools, making them more powerful and easy to use. Android Studio, on the other hand, has the responsibility of making it easier for developers to use and providing a more intuitive graphical interface.

In Android Studio 3.2, several performance analysis tools are directly integrated into the Profiler, including:

Sample Java Methods, which is similar to the “sample” type in Traceview.
Trace Java Methods, which is similar to the “instrument” type in Traceview.
Trace System Calls, which is similar to systrace.
SampleNative (API Level 26+), which is similar to Simpleperf.

To be honest, the Profiler interface is not as good as the interfaces of these tools in certain aspects, and the support for configurable parameters is not as good as the command line. However, the Profiler greatly reduces the learning curve for developers.

Another significant change is the way the analysis results are presented. These analysis tools all support two types of visualizations: Call Chart and Flame Chart. Let me explain in what scenarios each of these visualization methods is suitable.

1. Call Chart

Call Chart is the default visualization used in Traceview and systrace. It shows the function execution order of an application and is suitable for analyzing the entire call flow. For example, if function A calls function B, and function B calls function C, with this loop running three times, we get the following Call Chart.

The Call Chart is like an electrocardiogram for the application, where we can see the specific work of each thread during a certain period of time, such as whether there are thread locks, whether the main thread has long I/O operations, or whether there is idle time, etc.

2. Flame Chart

Flame Chart is also known as the famous Flame Graph. Unlike the Call Chart, the Flame Chart takes a global view of the call distribution over a period of time. It’s like taking an X-ray of the application, naturally fusing information in the time and space dimensions onto one chart. Using the example of function calls, here is the result displayed in a Flame Chart.

When we are not interested in the entire call flow of an application and only want to visually identify which code paths consume a lot of CPU time, the Flame Chart is an excellent choice. For example, I had a deserialization implementation that was very time-consuming. Through the Flame Chart, I found that the most time-consuming operation was the creation and copying of a large number of Java strings. By converting the core implementation to Native, the performance was ultimately improved many times over.

The Flame Chart can also be used for various other dimensions of analysis, such as memory or I/O. Some memory leaks may occur very slowly, but with a memory Flame Chart, we can see which paths allocate the most memory without the need to analyze source code or the entire flow.

In conclusion, each tool can generate different types of visualizations, and we need to choose the appropriate method based on different usage scenarios.

Conclusion #

In writing today’s article, which analyzes the basic knowledge of stuttering and the four Android stuttering troubleshooting tools, I increasingly realize the importance of fundamental knowledge. Android is based on the Linux kernel. Tools like systrace and Simpleperf also utilize mechanisms provided by Linux. Therefore, learning some basic Linux knowledge is very helpful for understanding the working principles of these tools and troubleshooting performance issues.

On the other hand, although many large companies have dedicated performance optimization teams, I believe it is more important to encourage and cultivate everyone in the team to pay attention to performance issues. While using performance tools, we should learn to think and understand their principles and limitations. Furthermore, you can also try to optimize these tools in order to achieve more complete solutions.

Homework #

When ANR occurs, the Android system will print CPU-related information to the log, using ProcessCpuTracker.java. However, it seems that there is no permission to obtain CPU information of other application processes. Can we change our approach?

When we find that the CPU usage of a specific process in the application is relatively high, we can check the CPU usage of each thread under this process through the following files, and then calculate the time percentage of each thread in the process.

/proc/[pid]/stat             // Process CPU usage
/proc/[pid]/task/[tid]/stat  // CPU usage of each thread under the process
/proc/[pid]/sched            // Process CPU scheduling related
/proc/loadavg                // System load average, corresponding to the uptime command file

If a thread is destroyed, its CPU running information will also be deleted, so we generally only calculate the CPU usage within a certain period of time. Below is an example of calculating the CPU usage of a sample process within a 5-second interval. Sometimes it may not be able to find time-consuming threads, which may be due to a large number of short-lived threads. In this case, you can try reducing the time interval.

usage: CPU usage 5000ms(from 23:23:33.000 to 23:23:38.000):
System TOTAL: 2.1% user + 16% kernel + 9.2% iowait + 0.2% irq + 0.1% softirq + 72% idle
CPU Core: 8
Load Average: 8.74 / 7.74 / 7.36

Process:com.sample.app 
  50% 23468/com.sample.app(S): 11% user + 38% kernel faults:4965

Threads:
  43% 23493/singleThread(R): 6.5% user + 36% kernel faults：3094
  3.2% 23485/RenderThread(S): 2.1% user + 1% kernel faults：329
  0.3% 23468/.sample.app(S): 0.3% user + 0% kernel faults：6
  0.3% 23479/HeapTaskDaemon(S): 0.3% user + 0% kernel faults：982
  ...

Your homework for today is to interpret the information above in the comment section and share your thoughts on where you think the bottleneck of this example is. Afterwards, can you go further and write a tool yourself to obtain these statistics for a certain period of time? Likewise, the final implementation can be sent as a Pull Request to Sample.

Feel free to click on “请朋友读” to share today’s content with your friends and invite them to study together. Don’t forget to submit today’s homework in the comments section. I have also prepared a generous “study encouragement gift” for students who complete the homework seriously. Looking forward to learning and progressing with you.