32 Online Troubleshooting How to Investigate and Track Issues

在解决线上疑难问题时，以下是一些可以帮助我们排查和跟踪问题的方法和工具：

收集和分析日志：在系统中加入日志并收集关键数据，例如异常堆栈信息、网络请求和响应信息等。通过仔细分析这些日志，可以找到问题的线索和根本原因。
监控和报警系统：通过实时监控系统的关键指标和性能数据，例如CPU、内存、网络等，可以快速发现异常情况并生成报警。在问题发生时，及时采取措施来减少影响。
代码调试和追踪：使用调试工具，例如断点调试器，在开发环境中逐行调试代码并检查变量的值，以确定问题发生的具体位置。可以使用追踪工具，例如日志追踪器，来记录方法的调用和返回，帮助理解代码执行的顺序和流程。
性能分析和优化：使用性能分析工具，例如Profiling工具，来检测应用程序的性能瓶颈，并找到需要优化的部分。这有助于提高系统的响应速度和效率。
容灾和备份：建立容灾机制和备份策略，以应对突发情况和数据丢失。例如，使用冗余的服务器和数据库，定期备份数据，确保系统的可用性和稳定性。
跟进用户反馈：与用户进行有效的沟通，细致地了解他们遇到的问题，并收集尽可能多的信息，例如设备型号、操作系统版本等。这些信息有助于更准确地重现和排查问题。
团队协作和知识共享：建立一个团队内部的知识库或文档分享平台，让团队成员可以分享和记录解决问题的经验和方法。这有助于快速解决类似问题，并提升整个团队的技术水平。

通过使用以上的方法和工具，我们可以更有效地排查和跟踪线上疑难问题，最终提升系统的稳定性和用户体验。

User Log #

For difficult problems, we can divide them into two categories: crashes and non-crashes. What are the traditional troubleshooting methods?

Local replication attempts. Regardless of whether it is a crash or a functional problem, as long as there is a stable replication path, we can use various means or tools to analyze it repeatedly. However, it is often difficult to replicate real difficult problems, as they may be related to the user’s device model, local storage data, and other environment factors.
Sending temporary or gray packages. If we send a temporary package to the user, the entire process can be cumbersome and the time to solve the problem can be long. Many times we cannot even contact the user, so we can only send gray packages online. But in order to gradually narrow down the scope of the problem, we may need to conduct gray deployments multiple times.

Oh, how we wish we had some “weapons” to help engineers collect sufficient information as quickly as possible at a very low cost, in order to quickly troubleshoot and solve problems.

1. Xlog

In the daily development process, we often use Logcat logs to troubleshoot and locate issues in the code.

For issues encountered in production, we also hope to have complete logs from users. Even if the problem cannot be replicated, it may be possible to locate the specific cause through the logs. The so-called “raising soldiers for a thousand days, using them for a moment” means that client logs only demonstrate their importance when problems occur and are not easily replicated. However, in order to ensure that logs are available in critical moments, the entire program lifecycle needs to be logged, so the choice of log solution is crucial.

In the past, due to performance and reliability issues, we usually only dynamically enabled logging for a small number of people. How can we implement a high-performance log solution that does not lose logs and is also secure? WeChat implemented their own high-performance log module Xlog in 2014, and open-sourced it as part of Mars on GitHub in 2016. For more implementation details of Xlog, you can refer to the source code or the conference sharing.

The introduction of Xlog allows all users to log constantly without worrying too much about the impact on application performance. However, Xlog is just a high-performance log tool, and whether it can ultimately solve our production problems depends on how we use it.

Therefore, WeChat has established strict logging standards and regularly checks the pulled logs against these rules. If any violations are found, certain penalties will be imposed. Here are some of the logging standards, and I’ve selected some to share with you.

Logging fears being too much or too little. We are afraid that there won’t be enough information to analyze and locate problems if we log too little. How much logging should be done and how to log it is not a very strict criterion, and it requires the entire team to gradually explore through long-term practice. At the beginning, everyone may not pay much attention to or be willing to add logs to key code. However, as more and more difficult problems are solved through the logging platform, more successful cases within the team will gradually establish this habit.

2. Logan

For mobile applications, we may have various types of logs, such as code logs, crash logs, tracking logs, and user behavior logs. Because different types of logs have their own characteristics, the logs are usually scattered. For example, when we want to investigate a problem, we need to check different logs on different log platforms. In order to solve this problem, Meituan proposed the idea of a unified log platform and also open-sourced their own mobile basic log library Logan on GitHub.

Logan integrates various log platforms to create a unified log platform, further improving the efficiency of developers in troubleshooting. Whether it is Logan or Xlog, logs are generally reported in the following two ways.

Push-pull. Use push commands to pull logs from specific users.

Active reporting. Logs are actively reported when users report problems or encounter crashes.

Have we achieved perfection with user logs? The coverage of manual tracking is limited, and if critical locations are not pre-tracked, it may be necessary to repackage. Therefore, Meituan has also launched Android dynamic log system Holmes based on Logan.

The implementation of Holmes is similar to Meituan’s Robust hotfix approach, where each method needs to be instrumented to record the method execution path. In other words, a piece of code is inserted at the beginning of the method, which will record the method signature, process, thread, time, and other information to form a complete execution log. However, this log system has a lot of technical challenges, so it is generally only enabled dynamically for users who encounter problems.

Although this approach has some insights, I don’t think it is very practical. First, instrumenting every method will have a significant impact on package size and performance, making this solution too cumbersome. Second, many difficult problems are occasional, and even if user logs are enabled after a problem occurs, it is not guaranteed that the problem can be replicated.

Dynamic Debugging #

“One thing developers often say to testers is, ‘If you can reproduce it locally, I can fix it.’ Locally, we can repeatedly validate issues by adding logs or using debuggers like GDB and others.

For remote users, it’s exciting to think about having the same dynamic debugging capabilities as we do locally. Is there a solution to enable remote dynamic debugging?

1. Remote Debugging

Dynamic debugging, or dynamic tracing, is an advanced debugging technique. In fact, it’s not a new topic. Well-known solutions such as DTrace and SystemTap in Linux, and BTrace in Java are already very mature. I recommend reading the articles “A Casual Introduction to Dynamic Tracing” and “Exploration of Java Dynamic Tracing Technology”. Especially the former, it was really enlightening.

On the Android platform, can we achieve dynamic debugging for users? Before answering this question, let’s think about the underlying principles of debugging in Android Studio.

In fact, our class monitor Birdy has already talked about this. In the article “Android JVM TI Mechanism Explained,” we discussed the Debugger Architecture. The Java debugging framework relies on JPDA (Java Platform Debugger Architecture) to define an independent and complete debugging system, which consists of the following three parts:

JVM TI: Java Virtual Machine Tools Interface (debuggee).
JDWP: Java Debug Wire Protocol (channel).
JDI: Java Debug Interface (debugger).

If you want to learn more about the Java debugging framework, you can review the reference links in the article “Android JVM TI Mechanism Explained.”

For Android, its debugging framework is an extension based on the Java debugging framework. It mainly includes Android Studio (JDI), ddmlib, adb server, adb daemon, and Android applications.

In order to achieve remote debugging for users, we need to modify two parts of this framework.

JDWP (transport channel): Instead of using the system’s adb, we want the user’s debugging information to be sent over a network channel.
JDI (frontend display): For displaying client debugging data, we cannot easily reuse Android Studio. We need to implement our own data display interface.

For details on the implementation, you can refer to the article “Exploration and Implementation of Android Remote Debugging” by Meituan. The overall process is as follows:

Of course, unlike debugging a local Debug package, when debugging for users, we also need to consider bypassing the impact of ProGuard and Debuggable. Overall, this solution has great technical value and can deepen our understanding of the Java debugging framework. But it is not practical because in most cases, it is difficult to debug without the user’s cooperation. The debugging process may also encounter various situations that are not easy to control.

However, as a backup option, we can use this approach to achieve “wireless debugging” locally (without adb), or perform debugging on obfuscated packages.

2. Dynamic Deployment

If remote debugging is not practical, is there any other way to debug without the user’s knowledge?

Being undetectable by users and updating code is exactly the ability that dynamic deployment possesses. Moreover, dynamic deployment is naturally suitable for troubleshooting difficult problems.

Precision. Using the release platform, we can selectively perform dynamic updates for certain problematic users. We can also target a specific group of users, for example, if a certain issue only occurs on a specific Huawei model, we can target that model for deployment.
Scenario. For troubleshooting difficult problems, we generally only need to add logs or make simple modifications to the logic, and dynamic deployment in this scenario is completely sufficient.
Reproducible, reversible. For difficult problems, we may need to try different solutions repeatedly, and dynamic deployment can completely solve this need. And after the problem is solved, we can rollback unnecessary patches in a timely manner.

I still remember that in order to solve the crash problem of libhw.so, we went through a month and released more than 30 dynamic deployments, repeatedly adding logs and hook points, and finally solved the problem.

3. Remote Control

Dynamic deployment has the problems of slow effectiveness (from a few minutes to more than ten minutes) and inability to cover 100% of users (modifying AndroidManifest or no remaining space on the user’s phone). For specific problems, we can handle them by issuing predetermined rules.

Network remote diagnostics is a very classic example. For example, if a user reports that a certain webpage cannot be opened, we can use local or remote command issuance to perform a complete detection of the user’s entire network request process to see if it is a user’s network problem, a DNS problem, at which stage of the request the error occurred, and what the error code is.

Mars also has a special SDT network diagnostic module, let’s take a moment to review the entire knowledge structure diagram of Mars.

In addition to network remote diagnostics, the troubleshooting and tracking of network problems itself is a very big topic. It involves the entire access chain of business requests from domain name resolution/traffic dispatch to business unified access to business invocation and is part of a large network platform.

We can use the traceId generated by the client to collect and integrate client logs, server invocation logs, self-built CDN logs, etc., to establish a monitoring platform based on users and provide problem localization functions. For example, Google’s Dapper, Alibaba’s EagleEye, WeChat’s clickstream platform, QQ’s end-to-end monitoring platform, etc., are all implemented through this approach.

Similar to network remote diagnostics, or deleting certain files, reporting certain information, these predetermined rules are built on the premise that we have encountered a certain pit, or in most cases, we have stepped on the same pit countless times and can no longer bear it before setting up a corresponding set of diagnostic rules. Can’t we call certain Java code simply without dynamic deployment?

At this time, we have to mention the very powerful Lua scripting language. The famous Wax hotfix for iOS before, Tencent Unity3D’s hot update solution, etc., are all implemented using Lua. Lua’s VM is very small, less than 200KB, which fully guarantees controllable time and memory overhead. We can issue commands to the target user, dynamically execute a piece of code and report the result, or take snapshots of certain objects and parameters during method execution.

Below is an example of using Lua and Android.

// Lua script function
function setText(textView)
    tv:setText("set by Lua."..s); // the s variable is injected by Java here
    tv:setTextSize(30);
end

// Android invocation
lua.pushString( "from java" );   // push the value of the variable to be injected
lua.setGlobal( "s" );            // push the variable name
lua.getGlobal( "setText" );      // get the Lua function
lua.pushJavaObject( textView );  // push the argument
lua.pcall( 1, 0, 0 );            // execute the function

For the use of Lua, you can refer to the official documentation. In order to make it easier for us to use Lua in Android, many open source libraries have also provided better encapsulation for Lua, such as AndroLua, and Alibaba also has a dynamic interface framework based on Lua, LuaViewSDK.

Meituan’s Holmes also uses Lua to add comprehensive capabilities such as DB queries, reporting plain text, ShardPreferences queries, obtaining Context objects, querying permissions, appending local tracking points, uploading files, etc. Because Lua scripts are so powerful, many large-scale apps have also integrated Lua in Android.

Summary #

For super applications like Meituan, Alipay, and Taobao, there may be thousands of people developing and collaborating on a single application across different platforms and businesses. With a large volume of business, collaboration across different regions, and various types of businesses, it can be time-consuming and exhausting whenever problems arise.

It is precisely because of experiencing repeated “pain” that we have the user logs and clickstream platform of WeChat, as well as Meituan’s Logan and Homles unified logging system. The so-called “improving quality and increasing efficiency” of a team is about identifying these pain points within the team and considering how to improve them. Whether it is automating processes or developing new tools and platforms, it is all towards this goal.

Homework #

In your work, what classic difficult problems have you encountered or solved? And what powerful troubleshooting tools are there for difficult problems? Please leave a comment to share with me and other classmates.

Whether it’s pushing or pulling user logs or issuing remote debugging commands, we need the ability to distinguish between users. For applications like WeChat, which have strong login requirements, we can use WeChat IDs as user identifiers. But how do we collect logs before the user logs in?

As for user identification, Google has its own best practices in place. For most non-strict login applications, it is very important to build their own user identification system. User identification needs to consider factors such as drift rate, collision rate, and cross-application compatibility. Popular solutions in the industry include Alibaba’s UTDID and Tencent’s MTA ID.

Today’s homework is to answer how an application can implement its own user identification system. Please write your answer in the comments.

Feel free to click on “Please Read to a Friend” to share today’s content with your friends and invite them to learn together. Finally, don’t forget to submit today’s homework in the comment section. I have prepared a generous “Study Encouragement Gift” for students who complete the homework diligently. Looking forward to improving together with you.