02 Crash Optimization Part2 How to Analyze Application Crashes

02 Crash Optimization Part2 How to Analyze Application Crashes #

In the detective comic “Detective Conan”, no matter where Conan goes, he always encounters new “cases”. This is also similar to the “daily life” of programmers. Every day at work, we encounter various difficult problems, and “crashes” are one of the more common issues.

Fixing a crash requires experience, just like solving a case. The more problems we analyze, the more proficient we become, and the faster and more accurate we can locate the problem. Of course, there are also many techniques to consider, such as what information should we pay attention to at the “crime scene”? How do we find more “witnesses” and “clues”? What is the general process of “investigating a case”? What investigation methods should be used for different types of “cases”?

“There is always only one truth,” and crashes are not to be feared. Through today’s learning, I hope you can become the Sherlock Holmes of the coding world.

Scene of the Crash #

The scene of the crash is our “primary crime scene” and it contains many valuable clues. The more information we uncover here, the clearer our direction for further analysis becomes, instead of relying on blind guesses.

The operating system is the “bystander” of the entire crash process and also our most important “witness.” A good crash capture tool knows what system information to collect and what content to dig deeper into in certain scenarios, thereby better helping us solve problems.

Next, let’s take a closer look at what information should be collected at the scene of the crash.

Crash Information

From the basic crash information, we can make preliminary judgments about the crash.

Process name, thread name. Whether the crashed process is a foreground process or a background process, and whether the crash occurs on the UI thread.
Crash stack trace and type. Whether the crash belongs to a Java crash, Native crash, or ANR (Application Not Responding), we focus on different points for different types of crashes. It is particularly important to look at the top of the crash stack trace to see whether the crash occurred in the system’s code or in our own code.

Process Name: ‘com.sample.crash’ Thread Name: ‘MyThread’

java.lang.NullPointerException at …TestsActivity.crashInJava(TestsActivity.java:275)

Sometimes, in addition to the crashed thread, we also want to get the logs of other key threads. For example, in the above example, although the MyThread thread crashed, I also want to know the current call stack of the main thread.

System Information

System information sometimes contains crucial clues that are very helpful in problem solving.

Logcat. This includes the running logs of the application and the system. Due to system permission issues, the obtained Logcat may only contain information related to the current app. Among them, the system’s event Logcat will record some basic information about the running of the app, which is recorded in the file /system/etc/event-log-tags.

system logcat: 10-25 17:13:47.788 21430 21430 D dalvikvm: Trying to load lib … event logcat: 10-25 17:13:47.788 21430 21430 I am_on_resume_called: Lifecycle 10-25 17:13:47.788 21430 21430 I am_low_memory: System memory is low 10-25 17:13:47.788 21430 21430 I am_destroy_activity: Destroy Activity 10-25 17:13:47.888 21430 21430 I am_anr: ANR and its reasons 10-25 17:13:47.888 21430 21430 I am_kill: APP is killed and its reasons
Device model, system, manufacturer, CPU, ABI, Linux version, etc. We will collect up to dozens of dimensions, which will be very helpful for finding common problems later on.
Device status: whether it is rooted, whether it is an emulator. We need to treat these issues differently as some problems are caused by Xposed or multiple instances software.

Memory Information

Many crashes are directly related to memory, such as OOM (Out of Memory), ANR, and virtual memory exhaustion. If we divide the user’s phone memory into two categories, “below 2GB” and “above 2GB,” we will find that the crash rate of “below 2GB” users is several times higher than that of “above 2GB” users.

Available system memory. Regarding the system’s memory status, we can directly read the file /proc/meminfo. When the available memory of the system is very small (less than 10% of MemTotal), issues like OOM, excessive GC, and frequent system crashes can easily occur.
Application memory usage. This includes Java memory, RSS (Resident Set Size), PSS (Proportional Set Size). We can determine the size and distribution of the application’s own memory usage. PSS and RSS are calculated through /proc/self/smap, and we can further obtain more detailed classification statistics such as apk, dex, and so on.
Virtual Memory. Virtual memory can be obtained through /proc/self/status, and the specific distribution can be obtained through the /proc/self/maps file. Sometimes we don’t pay much attention to virtual memory, but many issues such as OOM (out-of-memory) and tgkill are caused by insufficient virtual memory.

Name:     com.sample.name   // Process name
FDSize:   800               // Number of file descriptors currently used by the process
VmPeak:   3004628 kB        // Peak virtual memory size of the process
VmSize:   2997032 kB        // Current virtual memory size of the process
Threads:  600               // Number of threads in the process

In general, for 32-bit processes, if the CPU is 32-bit, reaching 3GB of virtual memory may cause memory allocation failures. If the CPU is 64-bit, the virtual memory is generally between 3-4GB. Of course, if we support 64-bit processes, virtual memory will not be a problem. Google Play requires support for 64-bit from August 2019. Although more than 90% of devices in China already support 64-bit, it will take longer for stores to support releasing different CPU architecture types.

Resource Information

Sometimes we may find that both the application’s heap memory and device memory are abundant, but there are still memory allocation failures, which may be related to resource leaks.

File Descriptors (fd). The limit of file descriptors can be obtained through /proc/self/limits. In general, the maximum number of file descriptors that a single process can open is 1024. However, if the number of file descriptors exceeds 800, it is dangerous. All file descriptors and their corresponding file names should be output to the log for further investigation of possible file or thread leaks.

opened files count 812:
0 -> /dev/null
1 -> /dev/log/main4
2 -> /dev/binder
3 -> /data/data/com.crash.sample/files/test.config
...

Thread count. The current number of threads can be obtained from the status file mentioned above. Each thread may occupy 2MB of virtual memory, and excessive threads will create pressure on virtual memory and file descriptors. Based on my experience, if the number of threads exceeds 400, it can be dangerous. All thread IDs and their corresponding thread names should be output to the log for further investigation of thread-related problems.

threads count 412:
1820 com.sample.crashsdk
1844 ReferenceQueueD
1869 FinalizerDaemon
...

JNI. When using JNI, if not careful, it is easy to encounter crashes such as reference invalidation or reference overflow. We can use DumpReferenceTables to count the JNI reference table and further analyze whether there are JNI leaks or other issues.

Application Information

In addition to the system, our application knows itself better and can provide more relevant information.

Crash scenario. At which Activity or Fragment did the crash occur? In which business logic?
Key operation path. Unlike detailed logging during development, we can record key user operation paths, which can greatly help us reproduce the crash.
Other custom information. Different applications may focus on different aspects. For example, Netease Cloud Music may be interested in the currently playing music, and QQ Browser may be interested in the current webpage or video being played. In addition, information such as running time, whether patches are loaded, and whether it is a fresh installation or upgrade are also crucial.

In addition to the above common information, for specific crashes, we may also need to obtain specific information such as disk space, battery level, network usage, etc. Therefore, a good crash capture tool should collect enough information for us according to different scenarios, providing more clues for analysis and problem localization. Of course, data collection should respect user privacy and ensure sufficient encryption and anonymization.

Crash Analysis #

With so much on-site information, we can start our real “solving the case” journey. For the vast majority of “cases,” as long as we are willing to put in the effort, we can ultimately uncover the truth. Do not fear problems, after patient and careful analysis, we can always keenly discover some abnormalities or key points, and we must also be bold enough to doubt and verify. Below, I will focus on introducing to you the “three steps” of crash analysis.

Step 1: Determine the focus

To confirm and analyze the focus, the key is to find important information in the logs and have a rough understanding of the problem. Generally, I suggest focusing on the following points in this step.

1. Confirm the severity. Solving crashes also depends on cost-effectiveness. We prioritize resolving top crashes or crashes that have a significant impact on the business, such as crashes during startup or payment processes. I once worked hard for several days to solve a major crash, but the next version of the product deleted the entire feature, which left me feeling frustrated.

2. Basic crash information. Determine the type of crash and the description of the exception to make a rough judgment about the crash. Generally, most simple crashes can be concluded at this step.

Java crash. Java crash types are relatively clear, such as NullPointerException indicating a null pointer and OutOfMemoryError indicating insufficient resources. At this point, you need to further check the “memory information” and “resource information” in the logs.
Native crash. You need to observe contents such as signals, codes, fault addresses, and Java’s stack at the time of the crash. For an introduction to the meanings of various signals, you can check Crash Signal Introduction. The more common signals are SIGSEGV and SIGABRT, with the former generally caused by null or illegal pointers and the latter mainly caused by ANR and exiting due to calling abort().
ANR. In my experience, first check the stack of the main thread to see if it is due to lock waiting. Then, check the ANR log for information such as iowait, CPU, GC, system server, etc., to further determine if it is an I/O issue, CPU competition issue, or freezing due to a large number of GC.

3. Logcat. Logcat usually contains valuable clues, and logs with Warning and Error log levels need special attention. From Logcat, we can see some system behaviors and the state of the phone at the time, such as “am_anr” when an ANR occurs, and “am_kill” when the app is killed. Logs output by different systems and manufacturers may vary. If you cannot find the cause of the problem from one crash log or gain useful information, do not give up. I recommend checking more crash logs for the same crash point.

4. Resource conditions. Combining the basic crash information, let’s take a look at whether it is related to “memory information” or “resource information.” For example, is it due to insufficient physical memory, insufficient virtual memory, or file handle (fd) leakage?

Whether it is resource files or Logcat, information related to memory and threads needs special attention. Many crashes are caused by improper use of these resources.

Step 2: Find commonalities

If the above methods still cannot effectively locate the problem, we can try to find any commonalities among such crashes. Finding commonalities can help us further identify differences, which brings us closer to solving the problem.

Device models, systems, ROMs, manufacturers, ABIs, and other collected system information can be used as dimensions for aggregation. Commonality problems, such as whether it is because Xposed is installed, whether it only occurs on x86 phones, whether it is only on the Samsung model, whether it only appears on Android 5.0 systems, etc., can be analyzed from the application information, such as open links, playing videos, country, region, etc.

Finding commonalities can provide more specific guidance for reproducing the problem in your next step.

Step 3: Attempt to reproduce

If we already have a rough understanding of the cause of the crash and want to further confirm more information, we need to try to reproduce the crash. If we have no clue whatsoever about the crash, we can try to reproduce it based on the user’s operation path and then analyze the cause of the crash.

“As long as I can reproduce it locally, I can solve it.” I believe this is something many developers and testers have said. The reason for such confidence is mainly because on stable reproduction paths, we can use various means or tools such as adding logs, using debuggers, GDB, etc., to further analyze the crash. Looking back on the time when we were developing Tinker, we encountered all kinds of weird problems. For example, when a manufacturer changed the underlying implementation or when the new Android system made changes, we would need to Google, look at the source code, and sometimes even search for the manufacturer’s ROM or manually flash ROM. This painful experience taught me that many difficult problems require us to be patient, repeatedly speculate, repeatedly test in a limited capacity, and repeatedly validate.

Difficult Problem: System Crash

System crashes often make us feel very helpless. It may be a bug in a certain version of Android, or it may be caused by modifications made to a manufacturer’s ROM. In such cases, the crash stack may not contain our own code, making it difficult to locate the problem directly. When dealing with these difficult problems, let me talk about my approach:

1. Find possible causes. Through the commonalities mentioned above, we can first determine whether it’s a problem with a specific system version or with a specific ROM from a manufacturer. Although the crash logs may not contain our own code, we can find some suspicious points through the operation paths and the logs.

2. Try to avoid. Check the questionable code calls to see if inappropriate APIs are being used or if there is another way to avoid the problem.

3. Solve with Hook. This can be divided into Java Hook and Native Hook. Let me give an example of a recent system crash that I solved. We found that there was a system crash related to Toast on the production environment, and it only occurred on Android 7.0. It seemed that the token of the window was already invalid when the Toast was being displayed. This could happen when the window was destroyed before the Toast needed to be shown.

android.view.WindowManager$BadTokenException: 
	at android.view.ViewRootImpl.setView(ViewRootImpl.java)
	at android.view.WindowManagerGlobal.addView(WindowManagerGlobal.java)
	at android.view.WindowManagerImpl.addView(WindowManagerImpl.java4)
	at android.widget.Toast$TN.handleShow(Toast.java)

Why doesn’t Android 8.0 have this problem? After inspecting the source code of Android 8.0, we found the following modification:

try {
  mWM.addView(mView, mParams);
  trySendAccessibilityEvent();
} catch (WindowManager.BadTokenException e) {
  /* ignore */
}

After much consideration, we decided to follow the approach in Android 8.0 and directly catch this exception. The key here is to find the Hook point, and this case is relatively simple. Inside Toast, there is a variable called mTN, whose type is handler. We just need to proxy it to capture the exception.

If you manage to do all the things I mentioned above, 95% or more of the crashes can be solved or avoided, and the majority of system crashes can be handled in the same way. Of course, there will always be some difficult problems that rely on the user’s real environment, and we hope to have capabilities like dynamic tracking and debugging. In future articles, we will also discuss advanced techniques such as xlog logging, remote diagnostics, and dynamic analysis, which can help us further debug difficult problems encountered on the production environment. Stay tuned for that.

Crash prevention and mitigation is a long-term process, and we hope to prevent crashes from happening as much as possible, nipping them in the bud. This may involve the entire flow of our application, including personnel training, compilation checks, static scanning work, as well as standardized testing, limited release, and deployment processes.

Crash optimization is not isolated either, it is related to the content we will discuss later on memory, UI lag, I/O, etc. Perhaps after you have finished the entire course and look back, you will have a different understanding.

Summary #

Today we introduced some analysis methods, special techniques, and solutions to difficult and common problems related to crashes. Of course, crash analysis should be specific to the problem at hand, and different types of applications may have different focuses. We should not limit ourselves to the methods mentioned above.

Let me share some personal insights. When solving crashes, especially difficult problems, I always feel anxious and worried. Sometimes, after solving one problem, I find that other problems disappear just like in the game “Happy Elimination”. Sometimes, when I can’t solve a problem, I feel frustrated, and when I do solve it, I feel even more frustrated. It may be just a small oversight in the code, but it costs a month of youth and many white hairs.

Homework #

In the long-term battle against crashes, you must have some classic and beautiful battles that you want to share with other classmates. Of course, there will also be some puzzling problems. Today’s homework is to share your thoughts and methods for resolving crash issues, and summarize what you have gained through practicing with samples.

If you want to challenge crashes, then the Top 20 crashes are opponents we cannot avoid. Among them, there are many difficult system crash issues, and TimeoutException is a more classic one.

java.util.concurrent.TimeoutException: 
         android.os.BinderProxy.finalize() timed out after 10 seconds
at android.os.BinderProxy.destroy(Native Method)
at android.os.BinderProxy.finalize(Binder.java:459)

Today’s sample provides a way to “completely resolve” TimeoutException, mainly hoping that you can better learn the approach to solving system crashes.

Analyze the source code. We found that TimeoutException is thrown by the system’s FinalizerWatchdogDaemon.
Look for methods that can be avoided. We tried calling its Stop() method, but we found synchronization issues before Android 6.0 in the production environment.
Look for other hooks. Through the code’s dependency relationship, we found a clever hook.

You can refer to the implementation in the sample for the final code, but it is recommended to only use it in the gray scale phase. It is worth mentioning that although there are some “black technology” that can help us solve certain problems, we need to use them with caution. For example, some “black technology” does not limit the frequency of keeping processes alive, which may cause the system to freeze.

Feel free to click “Invite a friend to read” to share today’s content with your friends and invite them to learn together. Finally, don’t forget to submit your homework in the comments section. I have also prepared a generous “Study Fuel Package” for students who complete the homework seriously. Looking forward to exploring and progressing together with you.