14 Q& a How to Analyze Java Programs With Perf Tool

14 Q&A How to Analyze Java Programs with perf Tool #

Hello, I am Ni Pengfei.

Today is our second Q&A session, and the topic is the perf tool that we have used multiple times before. The content mainly includes various questions about the usage methods of perf in previous cases.

Perf is very effective in performance analysis and is a core tool that everyone needs to master. The usage methods of perf are also very diverse, but don’t worry, for now, you only need to know how to use perf record and perf report. And if you don’t understand some kernel symbols in the call stack displayed by perf, it’s okay to skip them temporarily, it won’t affect our analysis.

Similarly, for your convenience in learning and understanding, they are not strictly arranged in the order of the article. If you need to review the original content, you can scan the QR code at the bottom right of each question to view it.

Question 1: When using the perf tool, I see hexadecimal addresses instead of function names #

This is also a commonly raised issue. In CentOS system, when using the perf tool, we can only see hexadecimal addresses instead of function names.

In fact, if you observe the bottom line of the perf interface, you will find a warning message:

Failed to open /opt/bitnami/php/lib/php/extensions/opcache.so, continuing without symbols

This indicates that perf cannot find the libraries that the analyzed process depends on. In this case, many dependent libraries cannot be found, but the perf tool only displays the warning message in the last line, so you can only see this warning.

This problem is also commonly encountered when analyzing Docker container applications because the libraries that container applications depend on are included in the image.

For this situation, I have summarized the following four solutions.

The first method is to build the same path of dependent libraries outside the container. This method is theoretically feasible, but I do not recommend it. On the one hand, it is because it is more troublesome to find these dependent libraries, and more importantly, building these paths will pollute the environment of the container host.

The second method is to run perf inside the container. However, this requires the container to run in privileged mode, but in actual applications, the application usually only runs in the form of a normal container. Therefore, it is generally not possible to execute perf analysis inside the container without permissions.

For example, if you run perf record in a normal container, you will see the following error message:

$ perf_4.9 record -a -g
perf_event_open(..., PERF_FLAG_FD_CLOEXEC) failed with unexpected error 1 (Operation not permitted)
perf_event_open(..., 0) failed unexpectedly with error 1 (Operation not permitted)

Of course, in fact, you can allow non-privileged users to execute perf event analysis by configuring /proc/sys/kernel/perf_event_paranoid (for example, change it to -1).

However, as I mentioned earlier, for the sake of safety, I do not recommend this method.

The third method is to specify the symbol path as the path of the container file system. For example, for the application in Lesson 05, you can execute the following command:

$ mkdir /tmp/foo
$ PID=$(docker inspect --format {{.State.Pid}} phpfpm)
$ bindfs /proc/$PID/root /tmp/foo
$ perf report --symfs /tmp/foo

# Don't forget to unbind after use
$ umount /tmp/foo/

However, note that you need to install the bindfs tool. The basic function of bindfs is to implement directory binding (similar to mount –bind). Here, you need to install version 1.13.10 of bindfs (this is also its latest release version).

If you have an older version installed, you can download the source code from GitHub and compile and install it.

The fourth method is to save the analysis records outside the container and then view the results inside the container. In this way, both the library and symbol paths are correct.

For example, you can do this. First, run perf record -g -p < pid> for a while (for example, 15 seconds) and then stop it with Ctrl+C.

Then, copy the generated perf.data file to the container for analysis:

$ docker cp perf.data phpfpm:/tmp
$ docker exec -i -t phpfpm bash

Next, continue to run the following commands in the bash of the container to install perf and use perf report to view the report:

$ cd /tmp/
$ apt-get update && apt-get install -y linux-tools linux-perf procps
$ perf_4.9 report

However, there are two points to note.

First, the version of the perf tool. In the last step, the tool we run is the version perf_4.9 installed inside the container, not the regular perf command. This is because the perf command is actually a symbolic link and will match the version of the kernel. However, the version of perf installed in the image may not be consistent with the kernel version of the virtual machine.

In addition, the php-fpm image is based on the Debian system, so the command to install the perf tool is not exactly the same as Ubuntu. For example, the installation method on Ubuntu is as follows:

$ apt-get install -y linux-tools-common linux-tools-generic linux-tools-$(uname -r)）

While in the php-fpm container, you should execute the following command to install perf:

$ apt-get install -y linux-perf

After following these methods, you can see the stack of sqrt inside the container:

In fact, aside from our case, you may also encounter this problem in non-containerized applications. If the symbol table of the ELF binary file is deleted by using strip during compilation of your application, you will also only see the addresses of functions.

Nowadays, disk space is large enough. Keeping these symbols may increase the size of the compiled files, but it is not a big problem for the entire disk space. So for the convenience of debugging, I recommend that you keep them.

By the way, the installation methods of various tools mentioned in the case can be regarded as the basic skills for our column study. I hope you can familiarize yourself with and master them. As mentioned earlier, if you do not know how to install something, you can first check the documentation, and if it still doesn’t work, you can search the internet or leave a comment in the article to ask questions.

Here, I also want to commend many students who have shared the methods they have discovered in the comments. Recording and sharing is a good habit.

Question 2: How to analyze a Java program using the perf tool #

These two questions are actually extensions of the previous perf question. For applications like Java that run on the JVM, the runtime stack is managed using built-in JVM functions. Therefore, at the system level, you can only see the JVM function stack, not the Java application stack directly.

perf_events actually supports JIT, but it requires a /tmp/perf-PID.map file for symbol translation. You can use the open-source project perf-map-agent to generate this symbol table.

In addition, to generate the complete call stack, you need to enable the -XX:+PreserveFramePointer option in the JDK. Since this involves a lot of Java knowledge, I won’t go into detail here. If your application is based on Java, you can refer to Netflix’s technical blog Java in Flames to see the detailed usage steps.

Speaking of this, I also want to emphasize one thing: when learning performance optimization, don’t limit yourself to a specific programming language or performance tool and get stuck on the details of the language or tool.

Understanding the overall analysis approach is what we should do first. Because the principles and approaches of performance optimization are applicable in any programming language.

Question 3: Why are many symbols not showing call stacks in perf reports? #

perf report is a tool that visualizes perf.data. In the case from Lesson 08, I directly presented the final result without going into detail about its parameters. It’s likely that many students encountered the same issue when running it on their machines and saw the following interface.

In this interface, we can clearly see that the call stack is only displayed for swapper, but all other symbols cannot show their stack traces, including the app application in our case.

We have encountered this situation before. So what should we do when we encounter output from performance tools that we can’t understand? Of course, we should consult the tool’s manual. For example, you can execute the command man perf-report to find the explanation for the -g parameter:

-g, --call-graph=<print_type,threshold[,print_limit],order,sort_key[,branch],value>
       Display call chains using type, min percent threshold, print limit,
       call order, sort key, optional branch and value. Note that ordering is
       not fixed so any parameter can be given in an arbitrary order. One
       exception is the print_limit which should be preceded by threshold.

           print_type can be either:
           - flat: single column, linear exposure of call chains.
           - graph: use a graph tree, displaying absolute overhead rates. (default)
           - fractal: like graph, but displays relative rates. Each branch of
                      the tree is considered as a new profiled object.
           - folded: call chains are displayed in a line, separated by semicolons
           - none: disable call chain display.

           threshold is a percentage value which specifies a minimum percent to be
           included in the output call graph.  Default is 0.5 (%).

           print_limit is only applied when stdio interface is used.  It's to
           limit number of call graph entries in a single hist entry.  Note
           that it needs to be given after threshold (but not necessarily
           consecutive).  Default is 0 (unlimited).

           order can be either:
           - callee: callee based call graph.
           - caller: inverted caller based call graph.
           Default is 'caller' when --children is used, otherwise 'callee'.

           sort_key can be:
           - function: compare on functions (default)
           - address: compare on individual code addresses
           - srcline: compare on source filename and line number

           branch can be:
           - branch: include last branch information in callgraph when available.
                     Usually more convenient to use --branch-history for this.

           value can be:
           - percent: diplay overhead percent (default)
           - period: display event period
           - count: display event count

From this description, we can see that the -g option is equivalent to --call-graph, and its arguments are the options separated by commas after it, which respectively mean the output type, the minimum threshold, the output limit, the sorting method, the sorting keyword, the branch, and the value type.

We can see that the default parameters here are graph,0.5,caller,function,percent. The detailed meanings are explained in the documentation, so I won’t repeat them here.

Now let’s go back to our problem. The call stack is not fully displayed, and the relevant parameter is the minimum threshold, which is the threshold. According to the explanation in the manual, when an event occurs with a proportion higher than this threshold, its call stack will be displayed.

The default value for threshold is 0.5%, which means that the call stack will be displayed when the event proportion exceeds 0.5%. Looking at the event proportion of our app in the case, it is only 0.34%, which is lower than 0.5%. So it’s normal that we can’t see the call stack of the app.

In this case, you just need to set a threshold smaller than 0.34% for perf report so that you can see the desired call graph. For example, execute the following command:

$ perf report -g graph,0.3

Then you will get the new output interface shown below. After expanding app, you can see its call stack.

Question 4: How to understand the perf report report #

At this point, I guess you may have wondered why we don’t just use the perf tool to solve the problem at the beginning and instead execute so many other tools. This question actually provides a good explanation.

In the perf report interface of question 4, you must have noticed that the swapper has a proportion as high as 99%. Intuitively, we should observe it directly, so why didn’t we do that?

In fact, once you understand the principle of the swapper, it’s easy to understand why we can ignore it.

When you see the swapper, you may first think of the SWAP partition. In fact, the swapper has nothing to do with SWAP. It creates the init process during system initialization, and then it becomes a low-priority idle task. In other words, when there is no other task running on the CPU, the swapper is executed. So you could call it “idle task”.

Returning to our question, in the perf report interface, if you expand its call stack, you will see that the swapper’s clock events are spent on do_idle, which means executing the idle task.

Therefore, when analyzing the case, we directly ignore the initial 99% proportion and instead analyze the app with only 0.3%. In fact, from here, you can also understand why we don’t use perf for analysis at the beginning.

In a multitasking system, events with a larger count do not necessarily indicate performance bottlenecks. So, observing a large value alone does not necessarily indicate any problem. To determine if there is a bottleneck, you need to observe multiple aspects and multiple indicators to cross-validate. This is also a point I repeatedly emphasized in the guide.

In addition, the meanings of Children and Self are actually explained in detail in the manual, and an example is given to explain how their percentages are calculated. In simple terms,

Self is the proportion occupied by the symbol (can be understood as a function) itself, in the last column;
Children is the sum of proportions occupied by other symbols called by this symbol (can be understood as sub-functions, including direct and indirect calls).

As mentioned in the comment, many performance tools do have a certain impact on system performance. Taking perf as an example, it needs to track various events in the kernel stack, which inevitably brings some performance loss. Although this has little impact on most applications, it can be a disaster for certain specific applications (such as those sensitive to clock cycles).

Therefore, when using performance tools, you should indeed consider the impact of the tools themselves on system performance. In such cases, you need to understand the principles of these tools. For example,

Dynamic trace tools like perf will bring some performance loss to the system.
Tools like vmstat and pidstat that directly read proc file system to obtain indicators will not cause performance loss.

Question 5: Recommended Books and References for Performance Optimization #

I am glad to see such a strong enthusiasm for learning in the comments. In fact, many articles have a large number of comments, and I hope that I can recommend books and learning materials. This is also something I am happy to see. The column study is definitely not the entirety of your performance optimization journey. It makes me very happy to be able to introduce you to entry-level materials, help you solve practical problems, and even inspire your enthusiasm for learning.

In the article “How to learn Linux performance optimization”, I have introduced Brendan Gregg, who is undoubtedly a master of performance optimization. You can basically see his performance tool chart in various Linux performance optimization articles.

Therefore, my favorite book on performance optimization is the one he wrote, “Systems Performance: Enterprise and the Cloud”. This book has also been translated into Chinese, titled “性能之巅：洞悉系统、企业与云计算”.

From the publication date, this book is indeed quite old, as the English version was published in 2013. But the reason why classics become classics is because they do not become outdated. The performance analysis ideas and many performance tools in this book are still applicable today.

Additionally, I also recommend you to follow his personal website http://www.brendangregg.com/, especially the page on Linux Performance, which contains a lot of information on Linux performance optimization, such as:

Linux Performance Tools Chart;
Performance Analysis Reference;
Performance Optimization Speech Videos.

However, many of the content here will involve a lot of kernel knowledge, which is not beginner-friendly. But if you want to become an expert, hard work and perseverance are inevitable. So, when you are reading these materials, do not give up as soon as you encounter something you don’t understand. It is normal to not understand some things when learning for the first time. Overcome fear, don’t give up, continue moving forward. Many of the problems ahead may be resolved, and it will be easier when reviewing for the second or third time.

As I’ve said before, hold onto the main line and don’t waver. Start with the basic principles, grasp the idea of performance analysis, and then gradually delve deeper into the details. Don’t try to bite off more than you can chew.

Finally, feel free to continue writing your questions in the comment section. I will continue to answer them. My goal remains unchanged: I hope to turn the knowledge in the articles into your capabilities. We not only practice in real scenarios, but also progress through communication.