06 Stutter Optimization Part2 How to Monitor Application Stutter

在监控线上的卡顿问题时，我们需要保留足够多的现场信息来帮助我们排查和解决问题。下面是一些监控卡顿问题的方法：

卡顿监控框架: 使用卡顿监控框架可以帮助我们检测应用的卡顿情况并记录相关信息。一些常用的卡顿监控框架包括BlockCanary、StrictMode等，它们可以帮助我们定位应用中出现卡顿的地方。
UI卡顿监控: 在关键场景中，我们可以使用UI卡顿监控工具来监控应用的UI响应时间。这些工具可以记录下UI线程的运行时间，帮助我们发现和定位卡顿问题。
Crash日志: 卡顿问题有时可能导致应用崩溃，在这种情况下，我们可以通过收集Crash日志来分析卡顿的原因。Crash日志中包含了应用在崩溃时的堆栈信息，可以帮助我们定位问题。
性能数据收集: 我们可以收集应用的性能数据来帮助我们监控卡顿问题。一些常用的性能数据指标包括CPU使用率、内存使用情况、网络请求时间等。
用户反馈: 用户的反馈也是一个重要的信息来源。如果用户遇到卡顿问题，我们可以询问他们的系统版本、网络环境等信息，帮助我们定位和解决卡顿问题。

以上是一些常用的方法，可以帮助我们监控线上的卡顿问题。通过收集和分析足够多的现场信息，我们可以更好地定位和解决问题，提升用户体验。

06 Stutter Optimization Part2 How to Monitor Application Stutter #

Previously, I talked about the methods of monitoring ANR, but I also mentioned two problems: one is that higher versions of the system do not have permission to read the system’s ANR logs; the other is that ANR is too reliant on system implementation, and we cannot flexibly control the parameters. For example, I think users can hardly tolerate a 3-second screen freeze, but the default parameters can only monitor freezes that last at least 5 seconds.

Therefore, in reality, we need to adopt other methods to monitor whether there are any stuttering issues and also monitor other specific indicators for specific scenarios.

1. Message Queue

The first stutter monitoring solution I designed is based on the implementation of message queues by replacing the Looper’s Printer. In 2013, I wrote a performance monitoring tool called WxPerformanceTool, which used this method for monitoring time-consuming tasks. Later, this tool was internally open-sourced in Tencent’s public components and was named one of the top ten components of the year in 2013.

Before celebrating, a colleague quickly complained to me about a problem: when this monitoring module was enabled online, the average frame rate dropped by at least 5 frames during fast scrolling. When I looked at the Traceview, I found that it was due to a large number of string concatenations, which caused serious performance loss.

Later, I quickly thought of another solution. We can use a monitoring thread to insert an empty message into the head of the main thread’s message queue every second. If this message is not consumed by the main thread after 1 second, it means that the time taken to process the blocking message is between 0 and 1 second. In other words, if we need to monitor a 3-second screen freeze, if the head message is still not consumed during the 4th polling, we can determine that the main thread has experienced a freeze of at least 3 seconds.

There is a certain margin of error in this solution as well, which is the interval time for sending empty messages. However, this interval time should not be too short either because both the monitoring thread and the main thread processing the empty messages will introduce some performance loss, but it doesn’t have a significant impact.

2. Hooking

However, after using it for a while, I still felt a bit uncomfortable. The stutter monitoring based on the message queue is not accurate because the currently running function may not be the one that takes the most time. Why is that?

I can explain it cleared with a diagram. Let’s assume that within a message loop, functions A, B, and C are executed in sequence. When the entire message execution exceeds 3 seconds, since functions A and B have already completed, we can only get the stack trace of function C, which may not be time-consuming in reality.

However, for large data on production environments, because functions A and B are relatively time-consuming, the probability of capturing them will be higher. Therefore, capturing logs of freezes on functions A and B will be more frequent after aggregation in the backend.

That’s why we still use the message queue-based method on production environments, but we definitely hope to achieve the same result as Traceview, which can obtain the execution time of all running functions during the entire stuttering process. Just like the result shown in the image below, we can clearly know that functions A and B are the main causes of the stuttering.

In this case, can we directly use the callback events of Android Runtime function calls to create a custom Traceview++?

The answer is yes, but it requires the use of Inline Hook technology. We can implement a solution similar to Nanoscope, which writes to memory first, but considering compatibility issues, this solution is not used in production environments. For large-scale applications, stability is the top consideration. If we add instrumentation during the compilation process, compatibility issues should be fine. In the previous lecture, we talked about systrace, which can automatically generate Trace Tags through instrumentation. Similarly, we can add code to monitor the execution time at the entry and exit of functions, but there are many details to consider.

Avoid exponential increase in method count. A similar function should be inserted at the entry and exit of the function, and each method in the code should be assigned a unique ID as a parameter during compilation.
Filter simple functions. Filter out simple functions such as direct return and i++, and support blacklisting configuration. For functions that are called frequently, they need to be added to the blacklist to reduce the performance impact of the entire solution.

Based on performance considerations, only the main thread will be monitored online. WeChat’s Matrix uses this solution. After a lot of optimizations, the final installation package size only increases by 1-2%, and the average frame rate drops by no more than 2 frames. Although the instrumentation solution has an overall acceptable impact on performance, it will only be used in gray packages.

The instrumentation solution looks good, but it also has its own shortcomings. It can only monitor the execution time of functions within the application itself and cannot monitor system function calls. The entire stack seems to be “missing” a part.

3. Profilo

In March 2018, Facebook open-sourced a library called Profilo, which combines the strengths of various solutions and impressed me. Specifically, it has the following advantages:

First, integration with atrace function. All performance data from ftrace’s instrumentation is written to the kernel buffer through the trace_marker file. Profilo intercepts the write operation through PLT Hook and analyzes only the events of interest. This way, we can obtain all the probes of systrace, such as component lifecycle, lock waiting time, class verification, GC time, etc.

However, most atrace events are relatively generic. From the event “B|pid|activityStart”, we do not know which specific activity is being created. Similarly, we can measure the time consumption of GC-related events, but don’t know why this GC occurred.

Second, fast retrieval of Java stack. Many people have a misconception that continuously obtaining the stack of the main thread in another thread is not time-consuming. But in fact, obtaining the stack is costly, as it requires pausing the execution of the main thread.

Profilo’s implementation is very ingenious. It quickly captures the Java stack in a similar way to native crash capturing by periodically sending SIGPROF signals. The whole process is shown in the following figure.

When the Signal Handler captures the signal, it obtains the currently executing thread, and through the Thread object, it obtains the current thread’s ManagedStack. ManagedStack is a singly linked list that holds the ShadowFrame or QuickFrame stack pointer of the current thread. First, it traverses the ManagedStack chain, then iterates through its internal ShadowFrame or QuickFrame to reconstruct a readable call stack, thereby unwinding the current Java stack. Through this method, the thread can continue running while we can also check it, and the time consumption is negligible. The code can be referred to: Profilo::unwind and StackVisitor::WalkStack.

No instrumentation, almost no performance impact, and full capture of information. Isn’t Profilo the perfect embodiment? Of course, due to its heavy use of cutting-edge technologies, compatibility is an issue to consider. It uses a lot of function hooking internally, and unwinding also relies heavily on Android Runtime. Facebook has already deployed Profilo in production, but since the fast stack retrieval feature of Profilo still does not support Android 8.0 and Android 9.0, it is recommended to enable this feature by sampling a subset of users due to stability concerns.

To summarize, regardless of the method we use for monitoring freezes, we can ultimately obtain the stack and some CPU running information at the time of the freeze. Most freeze issues are relatively easy to locate, such as the main thread executing a time-consuming task, reading a very large file, or performing a network request, etc.

Other Monitoring #

In addition to the problem of the main thread taking too long, what other lagging issues should we pay attention to?

Android Vitals is Google Play’s official performance monitoring service, and it includes three monitoring aspects related to lagging: ANR, startup, and frame rate. Especially for ANR monitoring, we should check it frequently because Google itself has the permission to accurately monitor and report ANRs.

As for startup and frame rate, Android Vitals only provides the distribution of intervals for the application, but it cannot determine the exact cause of the issues. This is one confusing aspect when it comes to performance optimization. Even if we find that the overall frame rate has dropped by 5 frames, we still don’t know where the issue is coming from, so we need to put in a lot of effort for secondary troubleshooting.

Can we directly get a stack trace, just like crashes and ANRs, and be told that the frame rate drop of 5 frames is due to poor code written here? Alternatively, if it is not possible to provide the stack trace directly, can we be told that the frame rate drop is due to the chat page, so that we can narrow down the scope of secondary troubleshooting?

1. Frame Rate

The industry uses Choreographer to monitor the frame rate of applications. Unlike lagging, we should exclude cases where the page is not being manipulated and only do statistics when drawing is happening on the interface.

So how do we determine if drawing is happening on the interface? This can be achieved by implementing the addOnDrawListener.

getWindow().getDecorView().getViewTreeObserver().addOnDrawListener

We often use average frame rate to measure the smoothness of the interface. However, in reality, a movie’s frame rate is only 24 frames, and whether users can perceive the average frame rate of 40 frames or 50 frames in an application varies. For users, the most obvious indicator is the occurrence of consecutive frame drops. Android Vitals defines a frozen frame as consecutive frame drops exceeding 700 milliseconds, which is equivalent to 42 or more consecutive frame drops.

Therefore, we can calculate a more valuable metric called frozen frame rate. The frozen frame rate is calculated by taking the percentage of frozen frame time out of the total time. When there is a frame drop, we can obtain the current page information, view information, and operation path, and report them to the server, thereby reducing the difficulty of secondary troubleshooting.

As shown in the following image, we can also define scenarios based on activities, fragments, or specific operations. By fine-tuning the average frame rate and frozen frame rate of different scenarios, we can further narrow down the scope of problem investigation.

2. Lifecycle Monitoring

The duration and number of invocations of the lifecycle of activity, service, and receiver components are also important performance issues we pay attention to. For example, the onCreate() method of an activity should not exceed 1 second, otherwise, it will affect the time it takes for users to see the page. Although services and receivers are background components, their lifecycles also consume the main thread, so they are issues we need to address.

For component lifecycles, we should adopt stricter monitoring and report the complete data. In the background, we can see the launch time and invocation count of each component’s lifecycle.

Once, we found that the launch count of two services was ten times higher than others, and after investigation, we discovered that it was due to frequent inter-launching. Receivers are also the same, and they both require passing through the System Server. There was a logging module that used broadcasts for inter-process communication and sent thousands of requests per second, leading to the system’s System Server freezing. Therefore, the invocation count of each component’s lifecycle is also very valuable as a reference.

In addition to the lifecycle of the four main components, we also need to monitor the launch count and duration of each process’s lifecycle. By analyzing the data below, we can determine if certain processes are being frequently launched.

For the implementation of lifecycle monitoring, we can use hooking techniques through pluginization. However, after Android P, I still don’t recommend using this approach. Instead, I recommend using compile-time instrumentation. I will talk about three instrumentation techniques: Aspect, ASM, and ReDex, in the future, so stay tuned.

3. Thread Monitoring

Managing Java threads is a headache for many applications. Dozens or even hundreds of threads are created during the application startup process. Moreover, most of these threads are not managed by a thread pool and are running freely.

On the other hand, certain threads have higher priority or activity levels, which occupy too much CPU. This reduces the responsiveness of the main thread’s UI, so we need to pay special attention to optimizing these threads.

For Java threads, I would generally monitor the following two aspects.

Thread count: We need to monitor the number of threads and how they are created. For example, whether our specific thread pool is being used can be determined by hooking the nativeCreate() function of the got thread. This is mainly used to perform thread convergence, which means reducing the number of threads.
Thread time: Monitor the user time (utime), system time (stime), and priority of threads. This mainly involves checking which threads have spent more utime+stime time and are consuming too much CPU. As mentioned in the previous “Practice for Each Lesson,” some short-lived threads are difficult to notice. Here, we need to combine thread creation monitoring.

At this point, you may be confused, is monitoring the main theme of lag optimization? There can be many causes of lag, such as functions taking a long time, slow I/O operations, thread competition, or locks, etc. Indeed, solving lagging issues is not difficult in many cases. The real difficulty lies in how to quickly identify these lagging points and find the true causes of lag through more auxiliary information.

Just like using various lag analysis tools locally, the difficult part of lag optimization lies in how to port them to the production environment in order to obtain more extensive lag information with the least performance cost. Of course, some lagging issues may be caused by I/O, storage, or network problems, and there will be dedicated content in the future to discuss the optimization methods for these issues.

Summary #

Today we learned several methods for monitoring stuttering. As technology advances, we have discovered some flaws in the old solutions and have been constantly iterating and evolving to find better solutions.

Facebook’s Profilo achieves fast retrieval of Java call stacks by leveraging the idea of JVM’s AsyncGetCallTrace and adapting it to Android Runtime. Systrace uses Linux’s ftrace, and Simpleperf takes inspiration from Linux’s perf tool. It’s still the familiar recipe, the familiar taste. Many innovative things are actually based on products that were developed 10 years ago in Java and Linux.

As I mentioned at the beginning of this column, it’s important to stay focused and not be easily distracted. It will greatly benefit our growth to deepen our understanding and learn some low-level technologies. In our daily development, we shouldn’t just settle for completing requirements. We should learn to think more about memory, stutters, and other performance issues. By thinking and doing more than others, we will naturally progress faster.

Homework #

After reading the methods I shared about optimizing stuttering, I believe you must have many good ideas and methods as well. Today’s homework is to share your “killer move” for optimizing stuttering. Please share your learnings, practice, gains, and reflections in the comments section.

Exercise #

In the previous issue, I mentioned the ftrace mechanism in Linux, and systrace is implemented using this system mechanism. Moreover, Profilo implements an “systrace” that can be used online through some black technology. So how is it implemented?

Through today’s Sample, you can learn about its implementation ideas. When you are familiar enough with these underlying mechanisms, you may not be limited to using them locally, but can move them online.

Of course, in order to better understand this sample, you may also need to supplement some background knowledge related to ftrace and atrace. You will find that these are indeed some knowledge about Linux from ten years ago, but they are still very useful today.

Feel free to click “Please Share with Friends” to share today’s content with your friends and invite them to learn together. Finally, don’t forget to submit today’s homework in the comments section. I have prepared a generous “Study Boosting Gift Pack” for students who complete the homework diligently. Looking forward to improving together with you.