50 Case Study How to Use Dynamic Tracing Part One #

Hello, I’m Ni Pengfei.

In the previous section, I used the high CPU usage problem of the ksoftirqd as an example to teach you the analysis method for high CPU usage of kernel threads. Let’s briefly review.

When encountering abnormal resource usage of kernel threads, many commonly used process-level performance tools cannot be directly used on kernel threads. In this case, we can use the built-in perf of the kernel to observe their behavior, identify hot functions, and further locate performance bottlenecks. However, the summary reports generated by perf are not intuitive, so I usually recommend using flame graphs to assist with troubleshooting.

In fact, when using perf to analyze the system’s kernel threads, the kernel threads are still running normally, so this method is also called dynamic tracing technology.

Dynamic tracing technology collects runtime information of the kernel or application program through probe mechanisms, enabling you to analyze and locate the issues you want to troubleshoot without modifying the code of the kernel and application program.

In the past, when troubleshooting and debugging performance issues, we often needed to set a series of breakpoints for the application program (such as using GDB), and then analyze the application program’s state at these breakpoints manually or through scripts (such as GDB’s Python extension). Alternatively, we would add a series of logs to find clues from them.

However, breakpoints often interrupt the normal operation of the application, and adding new logs often requires recompiling and deploying. Although these methods are still widely used today, they are often time-consuming and labor-intensive when troubleshooting complex performance issues, and they can have a huge impact on the normal operation of the application.

In addition, these methods also have many performance issues. For example, some problems occur infrequently and can only be encountered in the production environment. It is a huge challenge to reproduce such difficult-to-reproduce issues.

The emergence of dynamic tracing technology provides a perfect solution for these problems: it does not require stopping the service or modifying the application code. It can analyze the root causes of the problem while everything continues to run normally in the original way.

Moreover, compared with previous process-level tracing methods (such as ptrace), dynamic tracing often only brings minimal performance overhead (usually 5% or less).

Since dynamic tracing has so many benefits, what are the methods of dynamic tracing and how to use them? Today, I will show you this problem. Since dynamic tracing involves a lot of knowledge, I will explain it to you in two parts, the first part of which is presented today.

Dynamic Tracing #

When it comes to dynamic tracing, we have to mention DTrace, which originated from the Solaris system. DTrace is the pioneer of dynamic tracing technology, providing a universal observation framework and allowing for free extension using the D language.

The working principle of DTrace is shown in the following diagram. It runs resident in the kernel, and users can submit trace scripts written in the D language to the runtime in the kernel through the “dtrace” command to be executed. DTrace can trace all events in user mode and kernel mode, ensuring minimal performance overhead through a series of optimization measures.

(Image from BSDCan)

Although DTrace itself still cannot run on Linux as of today, it has also had a significant impact on dynamic tracing in Linux. Many engineers have attempted to port DTrace to Linux, and the most famous one among them is SystemTap, which is promoted by RedHat.

Similar to DTrace, SystemTap also defines a similar script language to facilitate the free extension according to user needs. However, unlike DTrace, SystemTap does not have a persistent kernel runtime. It needs to compile the script into a kernel module and then insert it into the kernel for execution. This also results in slower startup of SystemTap and a dependency on complete debug symbol tables.

(Image from Dynamic Tracing Technology)

In summary, in order to trace events in the kernel or user space, both DTrace and SystemTap associate the user-provided trace processing functions (generally called Actions) with probes, which are the detection points relied upon by various dynamic tracing technologies.

Event Sources of Dynamic Tracing #

According to the different event types, the event sources used in dynamic tracing can be divided into three categories: static probes, dynamic probes, and hardware events. Their relationship is shown in the following diagram:

(Image from Brendan Gregg Blog)

Among them, hardware events are usually generated by performance monitoring counters (PMC), which include various hardware performance metrics such as CPU cache, instruction cycles, branch prediction, etc.

Static probes refer to probes that are pre-defined in the code and compiled into the application or kernel. These probes will only be executed when the detection function is enabled; they will not be executed if it is disabled. Common static probes include tracepoints in the kernel and Userland Statically Defined Tracing (USDT) probes.

Tracepoints are actually some detection points inserted in the source code with control conditions, which allow adding processing functions later. For example, in the kernel, the most common method of static tracing is printk, which outputs logs. The Linux kernel defines numerous tracepoints that can be enabled or disabled through kernel compilation options.
USDT probes, which stands for User-level Statically Defined Tracing, require inserting the DTRACE_PROBE() code into the source code and compiling it into the application. However, many applications also have built-in USDT probes, such as MySQL, PostgreSQL, etc. Dynamic probes refer to probes that can be dynamically added at runtime without being defined in the code beforehand. Examples include function calls and returns. Dynamic probes support the addition of probe points on-demand in the kernel or applications, providing higher flexibility. There are two common types of dynamic probes: kprobes for kernel mode and uprobes for user mode.

kprobes are used to trace functions in kernel mode, including kprobe for function calls and kretprobe for function returns.
uprobes are used to trace functions in user mode, including uprobe for function calls and uretprobe for function returns.

Note that kprobes require CONFIG_KPROBE_EVENTS to be enabled at kernel compile time, while uprobes require CONFIG_UPROBE_EVENTS to be enabled.

Dynamic Tracing Mechanisms #

Building upon these probes, Linux also provides a series of dynamic tracing mechanisms such as ftrace, perf, and eBPF.

ftrace was originally used for function tracing and later expanded to support various event tracing functionalities. Its usage interface is similar to the previously mentioned procfs. Through debugfs (also supported by tracefs since 4.1), ftrace provides access interfaces to the user space in the form of regular files.

With this, you can interact with ftrace by reading or writing files within the mountpoint (usually the /sys/kernel/debug/tracing directory), allowing you to trace the runtime events of the kernel or applications without requiring additional tools.

perf is a familiar tool. We have used its event recording and analysis functionality in many previous cases. This is actually a simple form of static tracing mechanism. With perf, you can also customize dynamic events (perf probe) to focus on events you are truly interested in.

eBPF is an extension of BPF (Berkeley Packet Filter). It not only supports event tracing mechanisms but also allows for free extension through custom BPF code written using the C language. Therefore, eBPF is essentially a runtime resident in the kernel and can be considered the Linux version of DTrace.

In addition, there are many tools outside the kernel that provide rich dynamic tracing functionalities. The most common ones include SystemTap, which we have mentioned before, BCC (BPF Compiler Collection) that we have used multiple times, as well as sysdig, which is commonly used for container performance analysis.

When analyzing a large number of events, using flame graphs as we mentioned in the previous lesson can visualize a large amount of data, allowing you to intuitively discover potential issues.

Next, I will show you how to use these mechanisms to dynamically trace the execution of the kernel and applications through several examples. The following examples are based on the Ubuntu 18.04 system and are also applicable to other systems.

Note: All commands below are assumed to be run as the root user. If you are logged in as a regular user, please run the command sudo su root to switch to the root user.

ftrace #

First let’s take a look at ftrace. As mentioned earlier, ftrace provides an interface for user space through debugfs (or tracefs). So when using ftrace, we often start by switching to the mount point of debugfs.

$ cd /sys/kernel/debug/tracing
$ ls
README                      instances            set_ftrace_notrace  trace_marker_raw
available_events            kprobe_events        set_ftrace_pid      trace_options
...

If this directory does not exist, it means that your system has not mounted debugfs. You can execute the following command to mount it:

$ mount -t debugfs nodev /sys/kernel/debug

ftrace provides multiple tracers for tracking different types of information, such as function calls, interrupt disable, process scheduling, and so on. The specific supported tracers depend on the system configuration. You can execute the following command to query all the supported tracers:

$ cat available_tracers
hwlat blk mmiotrace function_graph wakeup_dl wakeup_rt wakeup function nop

Among these, “function” represents tracing the execution of functions, and “function_graph” traces the call relationships between functions, which generates a visual call graph. These are the two most commonly used tracers.

In addition to tracers, before using ftrace, you also need to specify the tracing targets, which include kernel functions and kernel events.

Functions are the names of functions in the kernel.
Events are predefined trace points in the kernel source code.

Similarly, you can execute the following commands to query the supported functions and events:

$ cat available_filter_functions
$ cat available_events

With an understanding of this basic information, next, I will use the “ls” command as an example to show you how to use ftrace.

To list files, the “ls” command opens directory files through the “open” system call, and in the kernel, the corresponding function for “open” is “do_sys_open”. So the first step we need to do is to set the function to be traced as “do_sys_open”:

$ echo do_sys_open > set_graph_function

Next, the second step is to configure the tracing options, enable function call tracing, and trace the calling processes:

$ echo function_graph > current_tracer
$ echo funcgraph-proc > trace_options

Then, the third step is to start the tracing:

$ echo 1 > tracing_on

The fourth step is to execute an “ls” command, and then stop the tracing:

$ ls
$ echo 0 > tracing_on

The fifth and final step is to view the tracing results:

$ cat trace
# tracer: function_graph
#
# CPU  TASK/PID         DURATION                  FUNCTION CALLS

| | | | | | | | | #

ls-12276 | | do_sys_open() {
ls-12276 | | getname() {
ls-12276 | | getname_flags() {
ls-12276 | | kmem_cache_alloc() {
ls-12276 | | _cond_resched() {
ls-12276 | 0.049 us | rcu_all_qs();
ls-12276 | 0.791 us | }
ls-12276 | 0.041 us | should_failslab();
ls-12276 | 0.040 us | prefetch_freepointer();
ls-12276 | 0.039 us | memcg_kmem_put_cache();
ls-12276 | 2.895 us | }
ls-12276 | | __check_object_size() {
ls-12276 | 0.067 us | __virt_addr_valid();
ls-12276 | 0.044 us | __check_heap_object();
ls-12276 | 0.039 us | check_stack_object();
ls-12276 | 1.570 us | }
ls-12276 | 5.790 us | }
ls-12276 | 6.325 us | } …

In the final output:

The first column indicates the running CPU;
The second column shows the task name and process PID;
The third column displays the function execution delay;
The last column is the function call graph.

You can see that the function call graph visually depicts the call relationship between different functions through indentation at different levels.

Of course, you may have noticed the downside of using ftrace - the five steps are cumbersome and not convenient to use. However, don’t worry, trace-cmd has already wrapped up these steps for you. This way, you can complete all these processes in the same command-line tool.

You can run the following command to install trace-cmd:

# Ubuntu
$ apt-get install trace-cmd
# CentOS
$ yum install trace-cmd

Once installed, the original five-step tracing process can be simplified to the following two steps:

$ trace-cmd record -p function_graph -g do_sys_open -O funcgraph-proc ls
$ trace-cmd report
...
              ls-12418 [000] 85558.075341: funcgraph_entry:                   |  do_sys_open() {
              ls-12418 [000] 85558.075363: funcgraph_entry:                   |    getname() {
              ls-12418 [000] 85558.075364: funcgraph_entry:                   |      getname_flags() {
              ls-12418 [000] 85558.075364: funcgraph_entry:                   |        kmem_cache_alloc() {
              ls-12418 [000] 85558.075365: funcgraph_entry:                   |          _cond_resched() {
              ls-12418 [000] 85558.075365: funcgraph_entry:        0.074 us   |            rcu_all_qs();
              ls-12418 [000] 85558.075366: funcgraph_exit:         1.143 us   |          }
              ls-12418 [000] 85558.075366: funcgraph_entry:        0.064 us   |          should_failslab();
              ls-12418 [000] 85558.075367: funcgraph_entry:        0.075 us   |          prefetch_freepointer();
              ls-12418 [000] 85558.075368: funcgraph_entry:        0.085 us   |          memcg_kmem_put_cache();
              ls-12418 [000] 85558.075369: funcgraph_exit:         4.447 us   |        }
              ls-12418 [000] 85558.075369: funcgraph_entry:                   |        __check_object_size() {
              ls-12418 [000] 85558.075370: funcgraph_entry:        0.132 us   |          __virt_addr_valid();
              ls-12418 [000] 85558.075370: funcgraph_entry:        0.093 us   |          __check_heap_object();
              ls-12418 [000] 85558.075371: funcgraph_entry:        0.059 us   |          check_stack_object();
              ls-12418 [000] 85558.075372: funcgraph_exit:         2.323 us   |        }
              ls-12418 [000] 85558.075372: funcgraph_exit:         8.411 us   |      }
              ls-12418 [000] 85558.075373: funcgraph_exit:         9.195 us   |    }
...

You will find that the output of trace-cmd is similar to that of cat trace.

From this example, we know that when you want to understand the execution process of a certain kernel function, you can trace it using ftrace and track its execution process.

Summary #

Today, I have taught you common dynamic tracing methods. Dynamic tracing refers to using probes provided in the kernel to track the behavior of a running system or application, assisting in identifying performance bottlenecks.

In Linux systems, common dynamic tracing methods include ftrace, perf, eBPF, and SystemTap. When you have located a specific kernel function but are unsure about its implementation details, you can use ftrace to trace its execution process. As for other dynamic tracing methods, I will continue to explain them in detail in the next lesson.

Reflection #

Finally, I have a question for you to ponder. In today’s case, we used the ftrace mechanism provided by the Linux kernel to understand the execution process of kernel functions. In the previous lesson, we used perf and flame graphs to observe the kernel’s call stack.

Based on these two cases, what do you think are the differences between these two methods? When you need to understand the behavior of the kernel, how do you choose between the two, or in other words, which method is suitable for what kind of scenarios? You can summarize your thoughts based on today’s content and your own operation records.

Feel free to discuss with me in the comments section, and feel free to share this article with your colleagues and friends. Let’s practice in real-world scenarios and make progress through communication.