33 Plus Lesson 3 Locating Application Issues Debugging Paths Are Very Important

33 Plus Lesson 3 Locating Application Issues - Debugging Paths Are Very Important #

We have already reached the 13th lecture in this course. I would like to thank all the students for their continuous learning and for leaving high-quality comments in the discussion section. Some of these comments are about sharing their own experiences and pitfalls, some provide detailed answers to the post-lecture questions, and others raise excellent questions that further enrich the content of this course.

Some students have mentioned that the case studies in this course are very practical and are often encountered in their work. As I mentioned in the preface, these 100 cases and approximately 130 small pitfalls are based on over 40% of real-life production incidents that I have experienced or witnessed, with the remaining 60% coming from my experience in developing business projects and from regularly reviewing other people’s code. Indeed, I have invested a lot of effort in organizing these case studies, and I am especially grateful for the recognition from all the students. I hope that you will continue to persevere in your learning and continue to engage in discussions with me in the comments section.

Some students have also provided feedback that having a systematic approach to troubleshooting is crucial. They hope that when they encounter problems, they will be able to calmly and efficiently identify the root cause. Therefore, in today’s lecture, I will share with you the insights I have gained over the years in emergency troubleshooting. These are all strategies and experiences I have accumulated in my roles as a technical leader and architect. I hope that they will be helpful to you. Of course, I also look forward to hearing from you about your own troubleshooting strategies in your day-to-day work.

Different Environments Require Different Approaches for Troubleshooting #

When it comes to troubleshooting, the first thing we need to understand is the environment in which the issue occurs.

If you are troubleshooting in your own development environment, you can use any tool you are familiar with, even for step-by-step debugging. As long as the issue can be reproduced, troubleshooting should not be too difficult. At most, you will need to debug the program within the JDK or third-party libraries for analysis.

If you are troubleshooting in a testing environment, there may be fewer debugging options available. However, you can still use tools such as jvisualvm provided by the JDK or Arthas developed by Alibaba to attach to the remote JVM process and investigate the issue. Additionally, the testing environment allows you to create data and simulate the pressure of scenarios to make it easier for issues to occur, which is helpful for testing.

If you are troubleshooting in a production environment, it is often more challenging. On one hand, production environments have strict security controls and usually do not allow debugging tools to attach to remote processes. On the other hand, resolving issues takes priority over troubleshooting time, making it difficult to allocate sufficient time for investigation. However, due to the authentic traffic, high visitor volume, strict network and security controls, as well as the complexity of the environment, production environments are more prone to problems and experience the most issues.

Next, I will explain in detail how to troubleshoot issues in a production environment.

Troubleshooting production issues heavily relies on monitoring #

In fact, troubleshooting problems is like solving a case. When issues occur in the production environment, it is not possible to keep the complete scene for troubleshooting and testing in order to restore the application as soon as possible. Therefore, whether there is sufficient information to understand the past and reconstruct the scene becomes the key to solving the case. The information mentioned here mainly refers to logs, monitoring, and snapshots.

Logs speak for themselves, with two main points to note:

Ensure that error and exception information can be recorded completely in file logs;

Ensure that the log level of the program in production is INFO or higher. When recording logs, it is important to use reasonable log priorities: DEBUG for development debugging, INFO for important process information, WARN for issues requiring attention, and ERROR for errors that interrupt the process.

Regarding monitoring, when troubleshooting problems in the production environment, it is necessary for the development and operations teams to have sufficient monitoring, and it should be done at multiple levels.

At the host level, monitor CPU, memory, disk, network, and other resources. If the application is deployed in a virtual machine or a Kubernetes cluster, in addition to monitoring the physical machine’s basic resources, the virtual machine or the Pod should also be monitored. The number of monitoring levels depends on the deployment scheme of the application. One layer of the OS requires one layer of monitoring.

At the network level, monitor dedicated line bandwidth, basic information of switches, and network latency.

Monitoring should be well-implemented for all middleware and storage, not only monitoring basic metrics such as CPU, memory, disk IO, and network usage of processes, but also monitoring important internal metrics of components. For example, the well-known monitoring tool Prometheus provides a large number of exporters to connect various middleware and storage systems.

At the application level, monitor common metrics such as class loading, memory, GC, and threads of the JVM process (for example, use Micrometer for application monitoring). In addition, ensure that application logs and GC logs can be collected and saved.

Now let’s take a look at snapshots. Here, “snapshot” refers to a snapshot of the application process at a certain moment. In general, we set the JVM parameters -XX:+HeapDumpOnOutOfMemoryError and -XX:HeapDumpPath=… for Java applications in the production environment to retain heap snapshots when an OutOfMemoryError occurs. In this course, we have also used the MAT tool multiple times to analyze heap snapshots.

After understanding the past and restoring the scene, let’s take a look at the approach to locate the problem.

Analyzing the Process of Problem Localization #

When localizing a problem, the first step is to identify which level the problem originates from. For example, whether it is an issue with the Java application itself or caused by external factors. We can begin by checking if the program has any exceptions. Exception messages generally provide specific information that can immediately point to the general direction of the problem. If the problem is related to resource consumption, there may not be any exceptions. In such cases, we can use monitoring metrics along with explicit problem points to locate the issue.

In general, program issues can be attributed to the following three aspects.

Firstly, bugs that occur after program release and can be immediately resolved by a rollback. The investigation of such problems can involve analyzing version differences after the rollback.

Secondly, external factors such as issues with the host, middleware, or database. The investigation methods for such problems can be divided into two categories based on host-level problems and component (middleware or storage) problems.

For host-level problems, we can use tools to investigate the following:

CPU-related problems can be investigated using tools such as top, vmstat, pidstat, and ps.
Memory-related problems can be investigated using tools such as free, top, ps, vmstat, cachestat, and sar.
IO-related problems can be investigated using tools such as lsof, iostat, pidstat, sar, iotop, df, and du.
Network-related problems can be investigated using tools such as ifconfig, ip, nslookup, dig, ping, tcpdump, and iptables.

For component problems, we can investigate the following:

Check if there are issues with the host where the component is located.
Examine basic information about the component’s processes and observe various monitoring metrics.
Look for log outputs of the component, especially error logs.
Enter the component’s console and use certain commands to inspect its operation.

Thirdly, issues that cause the system to freeze due to insufficient resources usually require restarting and scaling up to resolve the problem, followed by analysis. However, it is best to retain one node as the “scene” for further investigation. Insufficient system resources are generally manifested in four aspects: high CPU usage, memory leaks or OOM (Out of Memory) issues, IO issues, and network-related problems.

When it comes to high CPU usage, if the “scene” is still available, the specific analysis process is as follows:

Firstly, run the command top -Hp pid on the Linux server to determine which thread within the process is causing high CPU usage.
Then, press uppercase P to sort the threads by CPU usage and convert the thread ID that clearly occupies the CPU to hexadecimal.
Lastly, search for this thread ID in the thread stack output of the jstack command to locate the calling stack of the problematic thread at that time.

If it is not possible to directly run the top command on the server, we can use sampling to locate the problem. Run the jstack command every fixed interval of time (e.g., 10 seconds) and compare the samples to determine which threads are consistently in the running state, thus identifying the problem thread.

If the “scene” is no longer available, we can analyze it through the process of elimination. High CPU usage is generally caused by the following factors:

Sudden pressure: This kind of problem can be confirmed by checking the load balancing traffic or log volume before the application. Reverse proxies like Nginx record URLs, so we can rely on the access log of the proxy for more specific localization. We can also observe the thread count of the JVM through monitoring. If high CPU usage is caused by pressure, and the resource usage of the program does not appear abnormal, we can further locate the hotspots method using load testing and profiler tools like jvisualvm. If resource usage is abnormal, such as generating thousands of threads, we need to consider tuning.
GC (Garbage Collection): In this case, we can confirm it by monitoring GC-related indicators and analyzing the GC log of the JVM. If GC pressure is confirmed, it is likely that memory usage will also be abnormal. In this case, further analysis should be performed using the process for memory issues.
Endless loops or abnormal processing flows in the program: For such problems, we can analyze them in conjunction with application logs. In general, application execution will generate logs, so we should pay close attention to abnormal log volume.

For memory leak or OOM issues, the simplest analysis method is to use MAT (Memory Analyzer Tool) after heap dumps. Heap dumps contain a complete view of the heap and thread stack information. By observing the dominance tree map and histogram, we can immediately see objects that consume a large amount of memory, allowing us to quickly locate memory-related problems. We will provide a detailed introduction to this topic in the 5th appendix.

It should be noted that Java processes’ memory usage includes not only the heap but also the memory used by threads (number of threads * thread stack size) and the metadata area. Each memory area can possibly trigger an OOM. We can analyze it by observing metrics such as the number of threads and loaded class count. Additionally, we should pay attention to whether the JVM parameters are set unreasonably, which may restrict resource usage.

IO-related problems are generally not caused by internal factors of the Java process unless they are caused by code-related issues such as resource leakage.

Network-related problems are usually caused by external factors as well. For connectivity issues, they can generally be located by combining the exception information. For performance or transient interruption issues, simple judgments can be made using tools such as ping. If necessary, tcpdump or Wireshark can be used for further analysis.

Nine Points to Consider When Analyzing and Locating Problems #

Sometimes, when we analyze and locate problems, we may fall into pitfalls or lose direction. In such situations, you can refer to my nine insights.

First, consider the “chicken and egg” problem. For example, when we find that the business logic is slow and the number of threads is increasing, we need to consider two possibilities:

The program logic or external dependencies are problematic, causing the business logic to execute slowly. In this case, more threads are needed to cope with the same amount of traffic. For example, if 10 TPS (transactions per second) originally completes a request in 1 second and is supported by 10 threads, but now it takes 10 seconds to complete, then 100 threads are needed.
It is possible that the increase in request volume has led to an increase in the number of threads, and the application itself lacks CPU resources, which, coupled with context switching issues, slows down the processing.

When problems arise, we need to consider both internal performance and incoming traffic to determine whether the “slow” is the cause or the result.

Second, consider finding patterns through classification. When there is no clue to locate the problem, we can try to summarize patterns.

For example, if we have 10 application servers for load balancing, we can analyze the logs to determine whether the problems are evenly distributed or concentrated on one machine. Similarly, if application logs generally record thread names, we can analyze whether the logs are concentrated on a certain type of thread when problems occur. Furthermore, if we find that a large number of TCP connections have been established, we can use netstat to analyze which service these connections mainly connect to.

If patterns can be summarized, it is likely that we have found a breakthrough point.

Third, problem analysis should be based on the call topology and not make assumptions. For example, if we see Nginx returning a 502 error, we generally assume that downstream services are causing the gateway to fail to forward requests. However, for the downstream services, we cannot simply assume that it is our Java program. For example, in the topology, Nginx proxies Kubernetes’ Traefik Ingress, with the link being Nginx->Traefik->Application. If we only investigate the health of the Java program, we will never find the root cause.

Similarly, although we use Spring Cloud Feign for service invocation, a connection timeout does not necessarily mean that the server is at fault. It could be because the client invokes the server using a URL directly, instead of relying on Eureka for client-side load balancing. In other words, the client connects to the Nginx proxy rather than directly to the application. In this case, the client’s connection timeout is actually caused by the Nginx proxy going down.

Fourth, consider resource limitation issues. Observe various curve indicators. If you find that a curve steadily rises and then stabilizes at a certain level, it generally means that the resources have reached their limits or bottlenecks.

For example, when observing the network bandwidth curve, if it rises to around 120MB and then remains unchanged, it is highly likely that the 1GB network card or transmission bandwidth has been fully utilized. Similarly, if you observe that the number of active database connections increases to 10 and then remains unchanged, it is likely that the connection pool is full. Whenever you see such curves in monitoring, you should pay attention to them. Fifth, consider the impact of resource interactions. CPU, memory, IO, and network are like the five organs of a human body. They are interdependent. If one resource encounters a significant bottleneck, it may cause a chain reaction in other resources.

For example, a memory leak that prevents objects from being reclaimed can result in a large number of Full GC operations, causing increased CPU usage. Similarly, when we frequently cache data in memory queues for asynchronous IO processing, problems with the network or disk can lead to a surge in memory usage. Therefore, when troubleshooting, we should consider this factor to avoid misjudgments.

Sixth, when troubleshooting network issues, three aspects need to be considered: client-side issues, server-side issues, and transmission issues. For example, if there is a sluggishness in accessing the database, it may be due to client-side factors such as insufficient connection pool, GC pauses, or CPU saturation. It could also be a problem with the transmission link, including fiber optic cables, firewalls, or routing table settings. Alternatively, it may be a genuine server-side issue that needs to be investigated one by one for differentiation.

For server-side slowness, the MySQL slow log can typically provide insights. For transmission slowness, a simple ping test can help identify the issue. If these two possibilities are ruled out and only a portion of clients experience slow access, it is necessary to suspect issues with the clients themselves. When encountering slow access to third-party systems, services, or storage, it is not entirely reasonable to assume the problem lies with the server.

Seventh, snapshot tools and trend analysis tools should be used in combination. For example, jstat, top, and various monitoring curves are trend analysis tools that allow us to observe the changes in various metrics and locate approximate problem areas. On the other hand, jstack and Memory Analyzer Tool (MAT) for analyzing heap snapshots are snapshot analysis tools used to examine details of a specific point in an application at a given moment.

In general, trend analysis tools are used first to summarize patterns, and then snapshot analysis tools are used to troubleshoot problems. Reversing this order can lead to misjudgments because snapshot analysis tools reflect only the program state at a specific moment. It is not enough to draw conclusions based solely on the analysis of a single snapshot. If trend analysis tools are not available, at least multiple snapshots should be compared.

Eighth, do not easily suspect monitoring systems. I once read an analysis of an aviation accident where the pilot found that all fuel tanks of the aircraft were displaying low fuel levels on the instruments. His immediate suspicion was that the fuel gauge had malfunctioned, and he was reluctant to believe that the aircraft was truly low on fuel. As a result, the engines lost fuel and shut down shortly after. Similarly, when applications encounter problems, we usually examine various monitoring systems. However, sometimes we are more inclined to trust our own experience rather than believe what the monitoring charts show. This can lead us to completely investigate problems in the wrong direction.

If you truly suspect that there is an issue with the monitoring system, you can check whether it is displaying normal data for applications that are not experiencing problems. If it is normal, then you should trust the monitoring system rather than relying on your own experience.

Ninth, if the root cause cannot be determined due to missing monitoring data or other reasons, there is a risk of the same problem recurring, and the following three tasks need to be carried out:

Ensure that logs, monitoring, and snapshots are updated to avoid missing data and to facilitate identifying the root cause when the problem reoccurs.

Set up real-time alerts based on the symptoms of the problem to ensure prompt detection when problems occur.

Consider implementing a hot backup system that can be switched to quickly solve problems when they occur while also preserving the current state of the old system.

Key Review #

Today, I summarized and shared with you the approach to analyzing production environment issues.

First, analyzing problems always requires a basis, guessing alone won’t be enough. It is necessary to establish a foundation for monitoring in advance. Monitoring should be done at multiple levels, including basic operations and maintenance, application layer, and business layer. When troubleshooting, it is also necessary to refer to the performance of indicators from multiple monitoring levels for comprehensive analysis.

Second, when locating problems, we need to roughly classify the causes. For example, whether it is an internal or external problem, CPU-related or memory-related, only a problem with interface A or the entire application. Then we can further refine the exploration, and it is important to think about the problem from the larger to the smaller scope. When encountering bottlenecks in the troubleshooting process, we can first step back from the details and then review the points involved from a broader perspective before reexamining the problem.

Third, analyzing problems often relies on experience, and it is difficult to find a complete methodology. When encountering major problems, it is often necessary to rely on intuition to quickly identify the most likely areas, and luck may even play a role here. I also shared my nine experiences with you and recommended that you think and summarize more during your routine problem-solving, and extract more approaches and effective tools for analyzing problems.

Finally, it is worth mentioning that after identifying the cause of a problem, we must make good records and conduct post-analysis. Every failure and problem is a valuable resource, and post-analysis is not only about recording problems, but more importantly, making improvements. When conducting post-analysis, we need to do the following:

Record complete timelines, handling measures, reporting processes, and other information;

Analyze the fundamental causes of the problem;

Provide short-, medium-, and long-term improvement plans, including but not limited to code changes, SOP, and processes, and record the progress of each plan for closure;

Regularly organize team reviews of past failures.

Reflection and Discussion #

If you open an app and find a blank homepage, is this a client compatibility issue or a server issue? If it is a server issue, how can we further refine the localization? Do you have any analysis ideas?

When analyzing and locating issues, what monitoring or tools would you use?

Have you encountered any problems or incidents that took a long time to locate or left a deep impression on you? I am Zhu Ye. Please leave a comment in the comment section to share your thoughts. You are also welcome to share this article with your friends or colleagues for discussion.