01 Crash Optimization Part1 About Crash Matters

01 Crash Optimization Part1 About Crash Matters #

Whenever I meet developers from other products in various occasions, we can’t help but engage in a technical discussion. The first question that usually comes up is, “What is the crash rate of your product?”

Programmer A proudly says, “One percent.”

Programmer B next to him looks disdainfully and shouts, “One in a thousand!”

“After one in ten thousand,” programmer C says, and the whole room falls silent.

The crash rate is a fundamental indicator for measuring the quality of an application, which is something we all agree on. But does saying “one in ten thousand” necessarily mean it is better than my “one percent?” I think this question is not simply about comparing two values.

Today, let’s talk about all things related to “crashes.” I will start by discussing the two types of crashes in Android, and then delve into how to objectively measure this crash indicator and how to approach stability in relation to crashes.

Two types of crashes in Android #

We all know that crashes in Android can be divided into Java crashes and Native crashes.

To put it simply, Java crashes occur when an uncaught exception happens in the Java code, causing the program to crash. So how are Native crashes generated? Usually, they are caused by accessing illegal memory addresses in Native code, or by alignment issues, or even by program abort, all of which generate corresponding signal signals that result in program crashes.

Therefore, “crash” refers to an abnormality in the program, and the crash rate of a product is closely related to how we capture and handle these exceptions. Capturing Java crashes is relatively simple, but many students still have only a limited understanding of how to capture Native crashes. Below, I will focus on introducing the process and difficulties of capturing Native crashes.

Process of capturing Native crashes

If you are not very familiar with the basic knowledge of Native crash mechanisms, I suggest you read “Android Platform Native Code Crash Capture Mechanism and Implementation”. Here, I will mainly discuss the entire process of capturing and analyzing a complete Native crash.

Compilation end: When compiling C/C++ code, it is necessary to preserve the symbol information files.
Client end: When a crash is captured, as much useful information as possible is collected and written into log files, and then uploaded to the server at an appropriate time.
Server end: Read the log files reported by the client, search for suitable symbol files, and generate readable C/C++ call stacks.

Difficulties in capturing Native crashes

Chromium’s Breakpad is currently the most mature solution for capturing Native crashes, but many people think Breakpad is too complicated. In fact, I believe that capturing Native crashes is not easy to begin with. It’s like when Tinker was designed: if you only want 90% reliability, then most of the code can indeed be eliminated. But if you want to achieve 99% reliability and still be reliable under various adverse conditions, the effort spent later will be much greater than in the initial stage.

Therefore, among the three processes mentioned above, the most crucial part is ensuring that the client can still generate crash logs under various extreme conditions. Because when a crash occurs, the program is in an unsafe state, and if not handled properly, a secondary crash can easily occur. So, what are some challenging situations when generating crash logs?

Scenario 1: File handle leaks resulting in the failure to create a log file, what to do?

Response: We need to apply for a file handle fd in advance to prevent this situation from occurring.

Scenario 2: Crash logs fail to generate due to stack overflow, what to do?

Response: In order to prevent stack overflow from preventing the process from creating a call stack to execute the handling function, we usually use the common signalstack. In some special cases, we may also need to directly replace the current stack, so we also need to reserve space in the heap here.

Scenario 3: When the entire heap memory is exhausted, causing the failure to generate crash logs, what to do?

Response: In this case, we cannot safely allocate memory and dare not use functions from stl or libc because their internal implementation will allocate heap memory. If we continue to allocate memory at this time, it will result in heap corruption or a second crash. Breakpad has done a thorough job by repackaging the Linux Syscall Support to avoid direct calls to libc.

Scenario 4: What to do if crash logs fail to generate due to heap corruption or a second crash?

Response: Breakpad forks a child process from the original process to collect crash information. In addition, when it involves Java-related issues, we usually use a child process to handle them. This way, even if a second crash occurs, only this part of the information is lost, and our parent process can continue to obtain other information. In some special cases, we may even need to fork a grandchild process from the child process.

Of course, Breakpad also has some issues. For example, the generated minidump file is in binary format and contains too much unimportant information, resulting in a file size that can easily reach several MBs. However, the minidump is not completely useless. It has some advanced features like debugging with gdb and can display passed parameters, among others. Chromium plans to replace Breakpad with Crashpad in the future, but for now, it is still “too early to mobile.”

Sometimes we want to follow the Android text format and add more information that we consider important. In this case, we need to modify the implementation of Breakpad. These modifications are quite common, such as adding Logcat information, Java call stack information, and other useful information at the time of the crash. We will have a more detailed introduction in the next section.

To thoroughly understand Native crash capture, we need to have a certain understanding of virtual machine operations, assembly, and other internal knowledge. Creating a highly available crash collection SDK is not easy. It requires years of technical accumulation and consideration of numerous details. Every failure path or second crash scenario requires countermeasures or backup plans.

Choosing the right crash service

For many small and medium-sized companies, I do not recommend implementing such a complex system by themselves. Instead, they can choose third-party services. Currently, there are various platforms available, including Tencent’s Bugly, Alibaba’s Woodpecker, NetEase Cloud Catcher, Google’s Firebase, and more.

Of course, when it comes to platform selection, in terms of productization and community maintenance, Bugly is the best option in China. In terms of technical depth and capture capabilities, Woodpecker, created by Alibaba’s UC Browser Kernel team, is the best.

How to objectively measure crashes #

After gaining a better understanding of crashes, how can we objectively measure them?

To measure a metric, we first need to standardize the calculation method. If we want to evaluate the scope of user impact caused by crashes, we would look at the UV crash rate.

UV Crash Rate = UV with crashes / Logged-in UV

As long as a user experiences a crash once, it will be counted. Therefore, the UV crash rate is closely related to the duration of app usage, and this is also why WeChat’s UV crash rate is not considered low in the industry (passing the blame). Of course, at this point, we can also look at other metrics such as PV crash rate, startup crash rate, repeat crash rate, and the calculation methods are similar.

Why do we need to separately calculate the startup crash rate? Because startup crashes have the greatest impact on users, and when an app fails to start, it is often impossible to save it through hot-fixes. Splash ads, promotional activities, and many other apps with complex startup processes involving various resources and configuration releases are prone to problems. Apps that heavily rely on operations, such as WeChat Reading, Momo, Taobao, and Tmall, use a technique called “safe mode” to ensure the startup process of the client. After detecting a failed startup, users are given a chance to recover.

Now let’s go back to the story at the beginning of the article about the programmers’ “Hua Mountain Competition.” I’m going to reveal their “exclusive secret” for dealing with crash rates.

Programmer B encapsulated all threads and tasks with a try-catch block, “digesting” all Java crashes. As for whether the program will have other exceptional behaviors, that is up to God to manage. But at least I have achieved the goal of “one in a thousand.”

Programmer C believed that solving native crashes was too difficult, so he came up with a “great method” of not collecting all native crashes and happily reported a “one in ten thousand” accomplishment to the boss.

After understanding the “exclusive secret” behind these impressive numbers, I wonder what you think? In fact, programmers B and C are real cases, and their user base is not small. Overemphasis on technical metrics is a common phenomenon in China. Crash rate is just a number, and our starting point should be to provide users with a better experience.

How to Objectively Measure Stability #

So far, we have discussed what crashes are and how to objectively measure them. However, is crash rate equivalent to overall application stability? The answer is no. In addition to crashes, we often encounter another issue called ANR (Application Not Responding).

When ANR occurs, the system will display a dialog box that interrupts the user’s actions, which is highly intolerable for users. This brings us to another question - how do we detect ANR exceptions in the application? Generally speaking, there are two common approaches.

1. Use FileObserver to monitor changes in /data/anr/traces.txt. Unfortunately, many newer versions of ROMs no longer have permission to read this file. In this case, you may need to consider alternative paths. Overseas, Google Play services can be used, while in China, WeChat utilizes the Hardcoder framework (HC framework is a communication framework independent of the Android system. It allows the app and the manufacturer’s ROM to have real-time “dialogue” in order to fully adjust system resources and enhance the app’s running speed and graphics, thus improving the user experience of everyone’s mobile phones).

2. Monitor the running time of the message queue. This approach cannot accurately determine if an ANR exception has occurred, nor can it obtain the complete ANR log. In my opinion, this approach is better suited for evaluating performance in terms of app lag.

Reflecting back on when I designed Tinker, in order to ensure that hot patching does not affect the application’s startup, Tinker also designed a simple “safe mode” for the patch loading process. During startup, Tinker checks the previous type of application exit. If there were three consecutive abnormal exits detected, the patch will be automatically cleared. Therefore, in addition to common crashes, there are also other situations that can lead to abnormal application exits.

Before discussing what constitutes an abnormal exit, let’s take a look at the various scenarios in which an application can exit.

Voluntary termination. Process.killProcess(), exit(), etc.
Crash. Java or Native crashes occur.
System reboot; system anomalies, power failures, user-initiated reboots, etc. We can compare the application’s running time after power-on to the previously recorded value.
Killed by the system. Killed by low memory killer, swiped away from the system’s task manager, etc.
ANR.

We can set a flag during application startup, update the flag after voluntary termination or crash, and then check this flag on the next startup to confirm if an abnormal exit occurred during runtime. Out of the five exit scenarios mentioned above, we exclude voluntary termination and crashes (which are separately counted), and hope to monitor the remaining three types of abnormal exits. In theory, this abnormal exit detection mechanism can achieve 100% coverage.

Through this abnormal exit detection, we can identify issues such as ANR, low memory killer, system forced termination, freezing, power outages, and other problems that cannot be captured through normal means. Of course, there may be some false positives in the abnormal rate, such as when a user swipes the application away from the system’s task manager. For large-scale data analysis, it can still help us discover hidden problems in the code.

Therefore, we have a new metric to measure application stability, which is the abnormal rate.

Abnormal Rate (UV) = Number of abnormal exits or crashes / Number of logged-in users (UV)

Recently, we found that the proportion of abnormal exits increased significantly in a gray version of our application. After investigation, we discovered a major bug in the video playback, which could cause user’s phones to freeze or even restart. This is a problem that is difficult to detect with traditional crash collection methods.

Based on the application’s foreground and background states, we can categorize abnormal exits into foreground and background exits. “Killed by the system” is the main cause of background exits, but of course, we pay more attention to the abnormal exits in the foreground, which are more closely related to ANR, OOM, and other abnormal situations.

Through the abnormal rate, we can comprehensively evaluate the stability of the application. For online monitoring, it is necessary to improve the crash alert mechanism. In WeChat, we can achieve a 5-minute level crash alert, ensuring that major online issues can be discovered in a timely manner, and a decision can be made quickly to either release a new version or dynamically hot fix the problem.

Summary #

Today, I talked about two types of crashes in Android, focusing on the capture process and some difficulties of Native crashes. It is not easy to create a highly available crash collection SDK, as it involves knowledge of Linux signal processing, memory allocation, assembly, and other low-level concepts. The deeper your foundation, the more familiar you will be with learning these underlying knowledge.

Next, we discussed how to calculate the crash rate. The crash rate is related to the application duration, complexity, and the crash collection SDK. Besides the crash rate, we also learned about the current methods of collecting ANR and the issues encountered. Finally, we introduced a new stability monitoring indicator called the exception rate.

As technical professionals, we should not blindly pursue the crash rate as a single number. User experience should be our priority. Trying to hide problems often backfires. We should not use try-catch blocks to hide real issues arbitrarily. Instead, we should start from the source and understand the root causes of crashes to ensure the smooth flow of subsequent processes. When solving crashes, we should also consider the broader picture and not just focus on a specific crash. We need to think about how to solve and prevent similar crashes.

Managing crashes is a long-term process. In the next article of this series, I will focus on discussing methodologies for analyzing application crashes. Additionally, if you pay attention, you will notice that I have included many hyperlinks in this article, and there will be similar cases in future articles. Therefore, after reading this article or during the reading process, if you don’t understand related background information or concepts, you should take some time to read the surrounding articles. Of course, if you still don’t understand after reading, you can leave a comment in the comment section.

Homework #

Breakpad is a cross-platform open-source project. The homework for today is to use Breakpad to capture a native crash, and write your summary and thoughts in the comment section after studying and practicing.

Of course, in the column’s GitHub Group, I have also provided a Sample for you to practice with. If you haven’t used Breakpad before, you just need to compile it directly. I hope that through a simple process of capturing a native crash and generating and parsing a minidump file, you can deepen your understanding of the Breakpad working mechanism through practice.

I want to emphasize again, please make sure to participate in our post-lesson exercises. Develop the good habit of immediately practicing what you have learned from the very beginning. Only by doing this can you maximize learning efficiency and gradually approach the goal of “becoming an expert”. Of course, students who submit their assignments seriously also have the opportunity to receive study encouragement rewards. Now, it’s all up to you!

Feel free to click “Ask a friend to read” and share today’s content with your friends, inviting them to study together. Finally, don’t forget to submit today’s homework in the comment section. I have also prepared generous “study encouragement rewards” for students who complete their assignments seriously. Looking forward to making progress with you.