27 Jvm Problem Investigation and Analysis Part 1 Tuning Experiences

27 JVM Problem Investigation and Analysis Part 1- Tuning Experiences #

Generally speaking, as long as the system architecture is designed reasonably, most of the time the system will run smoothly. System crashes and other failures are rare events. In other words, business development is the main focus in most software engineering projects, which is why some people joke that “interviews are about building rockets, but actual work is about tightening screws.”

The main purposes of troubleshooting and analysis are:

Problem resolution and troubleshooting
Identification of system risks and vulnerabilities

Based on the complexity of the problem, it can be classified into two categories:

Routine problems
Difficult and complex issues

Routine problems are usually discovered and resolved during the development process, so issues encountered in production environments tend to be more complex and occur in unforeseen areas. Based on our years of experience in troubleshooting, complex issues can be approached in two ways:

Systematic and logical troubleshooting
Speculative troubleshooting based on historical experience

If you prefer the latter approach, it may waste a lot of time and the result may rely on luck. Worse still, because it is basically guesswork, this process is completely unpredictable. If time is tight, it can create pressure within the team and even lead to blame shifting and finger-pointing.

When a system experiences performance issues or failures, it is necessary to troubleshoot and investigate the problem from various levels to determine if the issue lies with the JVM.

Why is troubleshooting so difficult? #

Challenges of troubleshooting in production environments #

When troubleshooting specific problems in production environments, there are often many limitations that make the troubleshooting process painful.

1. The shorter the impact on customers, the better

When faced with customer complaints, the fastest way to resolve the issue may be to say, “Just restart the machine to restore normal operation”.

It is natural to want to avoid any impact on users by using the quickest method.

However, restarting the machine may destroy the fault scene, making it difficult to identify the root cause of the problem.

If the instance is restarted, it will no longer collect the actual situation that occurred, preventing us from learning from the incident and gaining insights.

Even if the restart resolves the current issue, the underlying problem still exists as a time bomb that may happen again and again.

2. Security-related limitations

Next are the limitations related to security, which result in production environments being isolated and segregated. In general, developers may not have permissions to access the production environment. Without access to the production environment, remote troubleshooting is the only option, which involves all the associated issues:

Each operation that needs to be executed requires multiple people to participate or review, which not only increases the time required to execute a single operation but also may result in the loss of some information during the communication process.

Especially when publishing temporary patch programs to the production environment, the trial and error situation of “hoping it works” may actually make things worse.

Because the testing and deployment process may consume hours or even days, it further increases the time consumed to solve the problem.

If multiple iterations are required to deploy such “patch programs that may not work”, it may take several weeks to resolve the issue.

3. Problems caused by tools

Another important point is the tools that need to be used: Installing and using certain tools in specific scenarios may make the situation worse.

For example:

Taking a heap dump of the JVM may cause the JVM to pause for several seconds or longer.
Printing more granular logs may introduce other concurrency problems, such as IO overhead, disk issues, etc.
The added detectors or analyzers may have significant overhead, causing the already slow system to completely freeze.

Therefore, applying patches to the system or adding new remote monitoring programs may ultimately take many days: Since there are so many problems in performing fault diagnosis in the production environment, it is natural that most of the time, we perform fault diagnosis in the development or testing environment.

Issues to Consider When Diagnosing in Testing and Development Environments #

If problem diagnosis and troubleshooting are performed in the development or testing environment, the troubles in the production environment can be avoided.

Because the development environment and the production environment have different configurations, there may also be problems: It is difficult to reproduce bugs or performance problems that occur in the production environment.

For example:

The data sources used in the testing environment and the production environment are different. This means that performance issues caused by data volume may not be reproduced in the testing environment.
The usage patterns of certain problems may not be easy to reproduce (sometimes referred to as “ghost problems”). For example, concurrent problems that occur only on February 29, or problems that occur only when multiple users access a certain feature simultaneously. It is also difficult to troubleshoot without knowing the cause in advance.
The applications in the two environments may also be different. The deployed configurations in production may be significantly different. These differences include: operating systems, clusters, startup parameters, and different packaging versions.

These difficulties can lead to an awkward situation of “It can’t be, it works fine on my machine.”

It can be seen that because it is different from the actual production environment, the nature of the current system environment may cause you to encounter some inexplicable obstacles when troubleshooting certain problems.

In addition to the constraints of a specific environment, there are other factors that can also lead to unpredictability in the troubleshooting process.

What preparations need to be made #

This section provides some handling experiences, but we hope they will not become your emergency measures (just like doctors don’t want you to come to the hospital).

It is best to have comprehensive system monitoring and targeted contingency plans in place during normal times, and to conduct exercises regularly.

Master industry-specific knowledge #

Skills can be divided into external and internal skills. Internal skills refer to various basic knowledge. External skills refer to various tools, skills, and techniques. Analyzing and troubleshooting problems requires a significant amount of specialized background knowledge to support it. Otherwise, you won’t have any idea how to even guess, and without direction, it is difficult to verify correctness.

To have the ability to troubleshoot complex problems, you need to have a certain understanding of related horizontal knowledge in the industry and preferably have a deep understanding and experience in the specific problem domain you are facing, often referred to as being a “T-shaped” talent.

What domain-related knowledge is required for JVM problem troubleshooting? Below are some fundamentals:

Proficiency in the Java language
Basic knowledge of JVM
Knowledge in the field of concurrency
Understanding of computer architecture and composition principles
Knowledge of TCP/IP network systems
Understanding of protocols such as HTTP and servers like Nginx
Knowledge in the database field
Skills in using search engines

And various skills and techniques that can be derived from these domains.

Troubleshooting is an essential process. As long as a system is being used by people, it is inevitable that some faults will occur. Therefore, we need to have a clear understanding of the current state and problems of the system. We cannot bypass the challenges brought about by different environments, but it is also impossible to “become an expert in 21 days.”

In the tech development industry, the 10,000-hour rule always holds true. Besides accumulating 10,000 hours of training time to become an expert in a field, there are actually faster solutions to alleviate the pain caused by troubleshooting.

Conduct Sampling Analysis in the Development Environment #

There is nothing wrong with sampling analysis of code, especially before the system goes live.

On the contrary, understanding the hotspots and memory consumption of various parts of an application can effectively prevent certain problems from affecting users in the production environment.

Although this technique can only simulate a portion of the problems faced in the production environment due to differences in data, usage, and environment, using this technique can perform risk assessment in advance and quickly locate the cause of problems if they occur.

Verification in the Testing Environment #

Investing appropriate resources in the quality assurance domain, especially through automated continuous integration and continuous delivery processes, can expose many problems early on. Thorough and comprehensive testing will further reduce accidents in the production environment. However, these tasks often lack resources, with comments like “The functionality is already completed, why should we spend manpower and resources on it when there is no return on investment?”

In practice, it is difficult to prove the reasonableness of investments in quality checks.

Products labeled “performance testing” or “acceptance testing” will ultimately compete with clear and measurable business goals (new feature development).

Now, when developers push for tasks like “implementing performance optimization,” if the priority is not elevated, these tasks will pile up and forever remain on the To-Do list.

To demonstrate the reasonableness of such investments, we need to link return on investment with quality. If reducing P1 fault incidents in the production environment by 80% allows us to generate twice the amount of profit, we can then persuade relevant personnel to do these tasks properly. Conversely, if we cannot demonstrate the benefits of our improvement efforts, we might not have the resources to improve quality.

There is a vivid example: Why don’t children in rural areas read books?

Parents say: Because they are poor, they don’t read books.

But why are they poor?

Parents say: Because they don’t read books, they are poor.

Proper Monitoring in the Production Environment #

The system is like a human body, as long as it is alive, it will get sick at some point in time. Therefore, the first thing we must accept is that problems will inevitably occur in the production environment, whether they are caused by bugs, human errors, or natural disasters.

Even organizations as advanced as NASA/SpaceX occasionally experience rocket explosions, so we need to be prepared for problems that occur online.

No matter how much analysis and testing we do, there will always be some overlooked areas where accidents happen.

Since it is impossible to avoid it altogether, troubleshooting in the production environment will always be necessary. In order to better complete our work, we need to monitor the status of the system in the production environment.

When a problem occurs, ideally, we already have the relevant information to solve it.
If we have the necessary information, we can quickly skip the steps of problem reproduction and information collection.

Unfortunately, there is no silver bullet in the field of monitoring. Even the latest technology cannot provide all the information in different scenarios.

A typical web application system should at least integrate the following components:

Log monitoring. Consolidate the logs from various server nodes so that the technical team can quickly search for relevant information, visualize logs, and trigger abnormal alerts. The most commonly used solution is the ELK stack, which saves logs to Elasticsearch, analyzes them with Logstash, and uses Kibana for visualization and querying.
System monitoring. Consolidate system metrics in the infrastructure and visualize them for easy querying. Pay attention to CPU, memory, network, and disk usage to discover system issues and configure monitoring alerts.
Application Performance Monitoring (APM) and user experience monitoring. Focus on the interactions of individual users to effectively show system performance and availability issues as perceived by users. At the very least, we can identify which service had a failure at which point in time. For example, integrating technologies such as Micrometer, Pinpoint, Skywalking, Plumbr, etc., can quickly identify issues in the code.

Make sure to perform system performance analysis and conduct testing acceptance in the development environment before the system is released, to reduce production failures.

Understanding the production deployment environment and implementing monitoring allows us to respond more quickly and predictably when failures occur.

Top-down Division of JVM Issues #

The previous section discussed general problem diagnosis and optimization strategies:

Implement monitoring, locate the problem, verify the results, summarize and generalize.

Now let’s take a look at the issues that exist in the JVM domain.

From the above diagram, the JVM can be divided into these parts:

Execution engine, including GC and JIT compiler.
Class loading subsystem, which usually has issues during development.
JNI part, which generally exists outside the JVM.
Runtime data area; Java divides memory into two major blocks: heap and stack.

With this understanding, when we accumulate knowledge, we can tackle each part in a top-down manner.

JVM issues in the production environment mainly focus on GC and memory. Stack memory, thread analysis, and other issues primarily assist in diagnosing problems with Java programs themselves.

If there are any unclear areas in these related knowledge points, please go back and reread the previous chapters.

I think these foundational technologies and knowledge need to be read and practiced 2-3 times in order to have a solid grasp. After all, understanding and mastery come from ourselves.

Standard JVM Parameter Configuration #

A reader friend asked:

I would like to summarize the JVM parameter settings in the course. What steps should be followed to set them?

As of now (March 2020), there are over 1000 configurable JVM parameters, with over 600 of them related to GC and memory configurations. From this parameter ratio, it can also be seen that the key areas of JVM troubleshooting and performance tuning are still GC and memory.

Having too many parameters is a big problem and makes it difficult to get started. It is also time-consuming to learn and understand.

However, in the vast majority of business scenarios, there are only about 10 commonly used JVM configuration parameters.

Let’s start with an example, and readers can add or subtract as needed.

# Set heap memory
-Xmx4g -Xms4g
# Specify GC algorithm
-XX:+UseG1GC -XX:MaxGCPauseMillis=50
# Specify number of GC parallel threads
-XX:ParallelGCThreads=4
# Print GC details
-XX:+PrintGCDetails -XX:+PrintGCDateStamps
# Specify GC log file
-Xloggc:gc.log
# Specify maximum size of Meta area
-XX:MaxMetaspaceSize=2g
# Set size of single thread stack
-Xss1m
# Specify automatic Dump on heap memory out of space
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/usr/local/

We have introduced these parameters in previous chapters.

In addition, there are also some commonly used property configurations:

# Specify default connection timeout
-Dsun.net.client.defaultConnectTimeout=2000
-Dsun.net.client.defaultReadTimeout=2000
# Specify time zone
-Duser.timezone=GMT+08
# Set default file encoding to UTF-8
-Dfile.encoding=UTF-8
# Specify random number entropy source
-Djava.security.egd=file:/dev/./urandom

Simple Troubleshooting Manual #

General Approach for Troubleshooting #

If you use familiar tools and have a clear understanding of troubleshooting rules, then environmental limitations are not a big issue anymore.

In fact, engineers responsible for troubleshooting and problem solving usually do not have a pre-planned process.

To be honest, have any of you ever done something like the following shell operations:

# Check current path
pwd

# List files in the current directory
ls -l

# Check system load
top

# Check available memory
free -h

# Check available disk space
df -h

# Check disk usage of current directory
du -sh *

# System activity report
sar
-bash: sar: command not found

# Install sysstat on Linux
# apt-get install sysstat
# yum -y install sysstat

# View the manual
man sar

# View recent reports
sar 1

# ???
sar -G 1 3
sar: illegal option -- G

# View the manual
man sar

# ......

If you feel familiar with the process mentioned above, don’t worry, everyone goes through it.

Most engineers lack experience in troubleshooting and performance tuning, making it difficult to use standard practices.

There’s nothing embarrassing about it - unless you’re someone like Brendan Gregg or Peter Lawrey, it’s hard to accumulate 10,000 hours of troubleshooting experience and become an expert in this field.

Lacking experience, you often need to use different tools to collect information for the current problem, such as:

Collecting various metrics (CPU, memory, disk IO, network, etc.)
Analyzing application logs
Analyzing GC logs
Capturing and analyzing thread dumps
Capturing heap dumps for analysis

The easiest things to troubleshoot are system hardware and operating system issues, such as CPU, memory, network, and disk IO.

The tools available for us are almost unlimited. Using unfamiliar tools may waste more time compared to actually solving the problem.

Quantify performance tuning #

Recalling the content of our course, there are three quantifiable performance metrics:

System capacity: such as hardware configuration, design capacity;
Throughput: the most intuitive metric is TPS (Transactions Per Second);
Response time: which includes system delay, including server-side delay and network delay.

They can be specifically extended to single-machine concurrency, overall concurrency, data volume, user count, budget costs, etc.

A simple process #

Different scenarios and different problems require different troubleshooting methods, and there is no fixed routine for determining problems.

Operations that can be done in advance include:

Training: Preparing relevant domain knowledge, skills, and tools usage techniques in advance.
Monitoring: As mentioned earlier, mainly three parts - business logs, system performance, APM (Application Performance Monitoring) metrics.
Warnings: Timely alerts when failures occur; warnings when metrics exceed thresholds.
Identifying risk points: Understanding system architecture and deployment structure, analyzing single points of failure, scalability bottlenecks, etc.
Assessing system performance and service levels: For example, availability, stability, concurrency, scalability, etc.

Different companies may have their own accident handling guidelines, which may involve these factors:

Relevant personnel: including development, operation, management, QA, customer support, etc.
Incident level, severity, scope of impact, urgency.
Reporting, communication, consultation.
Problem investigation, diagnosis, localization, monitoring, analysis.
Incident summary, root cause analysis, preventing recurrence.
Improvements and optimizations, such as using new technologies, optimizing architecture, etc.

Points to investigate #

Query business logs to discover problems such as high request pressure, spikes, fallbacks, circuit breakers, etc., with basic services and external API dependencies.
View system resources and monitoring information:

Hardware information, operating system platform, system architecture.
Investigate CPU load.
Insufficient memory.
Disk usage, hardware failures, full disk partition, IO waiting, IO intensive, data loss, concurrency competition, etc.
Investigate network: traffic saturation, response timeout, no response, DNS issues, network fluctuations, firewall issues, physical failures, network parameter adjustments, timeouts, connection numbers, etc.

View performance metrics, including real-time monitoring and historical data. Phenomena such as freeze-ups, hiccups, and slow response times can be discovered.

Investigate databases: concurrent connections, slow queries, indexes, disk space usage, memory usage, network bandwidth, deadlocks, TPS, data query volume, redo logs, undo logs, binlog, proxies, tool bugs. Possible optimizations include: clustering, master-slave replication, read-only instances, sharding, partitioning.
Big data, middleware, JVM parameters.

Investigate system logs, such as restarts, crashes, kills.
APM (Application Performance Monitoring), such as identifying slow request chains, etc.
Investigate application systems:

Check configuration files: startup parameter configuration, Spring configuration, JVM monitoring parameters, database parameters, log parameters, APM configuration.
Memory issues, such as memory leaks, memory overflow, memory amplification caused by batch processing, GC issues, etc.
Investigate GC: determine GC algorithms, determine GC KPIs, GC total elapsed time, maximum pause time, analysis of GC logs and monitoring indicators: memory allocation speed, generational promotion speed, memory usage, etc. Adjust memory configuration when necessary.
Investigate threads: understand thread status, concurrent thread count, thread dumps, lock resources, lock waiting, deadlock.
Investigate code: such as security vulnerabilities, inefficient code, algorithm optimization, storage optimization, architectural adjustments, refactoring, resolving business code bugs, third-party libraries, XSS (Cross-Site Scripting), CORS (Cross-Origin Resource Sharing), regular expressions.
Unit testing: coverage, boundary values, mock testing, integration testing.

Eliminate resource competition, noisy neighbor effect.
Troubleshooting difficult problems analysis methods:

Thread dumps.
Memory dumps.
Sample analysis.
Code adjustment, asynchronous processing, peak load shaving.

In conclusion, the rapid development of the software field to this day has provided us with rich resources and tools. Diligent learning and practice, mastering some common routines, and skillfully using various tools are the keys to our technical growth and the ability to solve problems quickly.