51 Case Study How to Use Dynamic Tracing Part Two #

Hello, I’m Ni Pengfei.

In the previous section, I introduced you to common dynamic tracing methods. Dynamic tracing is the process of tracking the behavior of a system or application while it is running, using probes provided by the kernel. This helps identify performance bottlenecks.

By using dynamic tracing, you can gain insights into the behavior of an application or the kernel without modifying the code or restarting the service. This is particularly effective for troubleshooting online issues, especially those that are difficult to reproduce.

In Linux systems, common dynamic tracing methods include ftrace, perf, eBPF, and SystemTap. In the previous lesson, we specifically studied the usage of ftrace. Today, let’s take a look at a few other methods together.

perf #

perf has been our old friend. In previous cases, we have used it multiple times to find hot functions in applications or in the kernel in order to locate performance bottlenecks. In the case of high CPU usage in kernel threads, we also used flame graphs to dynamically visualize perf’s event recording, which allowed us to more intuitively identify the problems.

However, in our previous use of perf record/top, we first sampled the events and then evaluated the frequency of function calls based on the samples. In fact, perf’s capabilities go far beyond this. For example,

perf can be used to analyze various hardware events such as CPU cache, CPU migration, branch prediction, and instruction cycles;
perf can also be used to dynamically trace only the events of interest.

Next, we will once again use the kernel function do_sys_open and the user space function readline as examples to see how to use perf for dynamic tracing.

Just like with ftrace, you can use the perf list command to query all the supported events:

$ perf list

Then, add the –event option to the various subcommands of perf to set the events of interest to trace. If these predefined events do not meet your actual needs, you can also use perf probe to dynamically add events. Furthermore, in addition to tracing kernel events, perf can also be used to trace user space functions.

Let’s first look at an example of perf tracing a kernel function, do_sys_open. You can run the perf probe command to add a probe for do_sys_open:

$ perf probe --add do_sys_open
Added new event:
  probe:do_sys_open    (on do_sys_open)
You can now use it in all perf tools, such as:
    perf record -e probe:do_sys_open -aR sleep 1

After successfully adding the probe, you can use it in all perf subcommands. For example, the output above is an example of perf record, executing which will sample the do_sys_open function within 10 seconds:

$ perf record -e probe:do_sys_open -aR sleep 10
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.148 MB perf.data (19 samples) ]

Once the sampling is successful, you can execute perf script to view the sampling results:

$ perf script
            perf 12886 [000] 89565.879875: probe:do_sys_open: (ffffffffa807b290)
           sleep 12889 [000] 89565.880362: probe:do_sys_open: (ffffffffa807b290)
           sleep 12889 [000] 89565.880382: probe:do_sys_open: (ffffffffa807b290)
           sleep 12889 [000] 89565.880635: probe:do_sys_open: (ffffffffa807b290)
           sleep 12889 [000] 89565.880669: probe:do_sys_open: (ffffffffa807b290)

The output also lists the task name, process PID, and the CPU on which the do_sys_open function is called. However, for the open system call, knowing that it is being called is not enough, we also need to know which files the process is actually opening. Therefore, in practical applications, we would like the trace to display the arguments of these functions.

For kernel functions, you can of course refer to the kernel source code to find all its parameters. However, there is a simpler method, which is to query directly from the debug symbol table. By executing the command below, you can find all the parameters of do_sys_open:

$ perf probe -V do_sys_open
Available variables at do_sys_open
        @<do_sys_open+0>
                char*   filename
                int     dfd
                int     flags
                struct open_flags       op
                umode_t mode

From here, we can see that the file path we are interested in is the first character pointer parameter (i.e., the string), and the parameter name is filename. If this command fails to execute, it means that the debug symbol table has not been installed. In that case, you can run the command below to install the debug information and try again:

# Ubuntu
$ apt-get install linux-image-`uname -r`-dbgsym
# CentOS
$ yum --enablerepo=base-debuginfo install -y kernel-debuginfo-$(uname -r)

Once you have identified the parameter names and types, you can add them to the probe. However, since we have already added a probe with the same name, you need to delete the old probe before adding the new one:

# Delete the old probe first
perf probe --del probe:do_sys_open

# Add the probe with parameters
$ perf probe --add 'do_sys_open filename:string'
Added new event:
  probe:do_sys_open    (on do_sys_open with filename:string)
You can now use it in all perf tools, such as:
    perf record -e probe:do_sys_open -aR sleep 1

After adding the new probe, you can re-execute the record and script subcommands to sample and view the records:

# Resample the record
$ perf record -e probe:do_sys_open -aR ls

View results #

$ perf script perf 13593 [000] 91846.053622: probe:do_sys_open: (ffffffffa807b290) filename_string="/proc/13596/status" ls 13596 [000] 91846.053995: probe:do_sys_open: (ffffffffa807b290) filename_string="/etc/ld.so.cache" ls 13596 [000] 91846.054011: probe:do_sys_open: (ffffffffa807b290) filename_string="/lib/x86_64-linux-gnu/libselinux.so.1" ls 13596 [000] 91846.054066: probe:do_sys_open: (ffffffffa807b290) filename_string="/lib/x86_64-linux-gnu/libc.so.6” …

Don’t forget to delete the probe after use #

$ perf probe –del probe:do_sys_open

Now, you can see the files that are being opened every time the open function is called. By the way, do these results look familiar to you?

In fact, when we trace the system calls of a process using strace, we often see these dynamic libraries. For example, when tracing the ls command using strace, you can get the following results:

$ strace ls … access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory) access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory) openat(AT_FDCWD, “/etc/ld.so.cache”, O_RDONLY|O_CLOEXEC) = 3 … access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory) openat(AT_FDCWD, “/lib/x86_64-linux-gnu/libselinux.so.1”, O_RDONLY|O_CLOEXEC) = 3 …

You might be wondering why we need to use perf when strace is already easy to use. Well, many people only see the simplicity and ease of use of strace, but ignore its impact on process performance. In terms of principles, strace is based on the ptrace system call, which brings two problems.

Since ptrace is a system call, there will be a context switch between kernel mode and user mode. When there are a large number of events, the busy context switch will inevitably affect the performance of the original service.
ptrace requires the use of the SIGSTOP signal to suspend the target process. This signal control and process suspension will affect the behavior of the target process.

Therefore, I do not recommend using strace (or other performance tools based on ptrace) for troubleshooting and debugging in performance-sensitive applications such as databases.

Inspired by strace and combined with the utrace mechanism in the kernel, perf also provides a trace subcommand, which is the preferred tool to replace strace. Compared to the ptrace mechanism, perf trace, based on kernel events, naturally has better performance for process tracing.

The usage of perf trace is similar to strace, as shown below:

$ perf trace ls ? ( ): ls/14234 … [continued]: execve()) = 0 0.177 ( 0.013 ms): ls/14234 brk( ) = 0x555d96be7000 0.224 ( 0.014 ms): ls/14234 access(filename: 0xad98082 ) = -1 ENOENT No such file or directory 0.248 ( 0.009 ms): ls/14234 access(filename: 0xad9add0, mode: R ) = -1 ENOENT No such file or directory 0.267 ( 0.012 ms): ls/14234 openat(dfd: CWD, filename: 0xad98428, flags: CLOEXEC ) = 3 0.288 ( 0.009 ms): ls/14234 fstat(fd: 3</usr/lib/locale/C.UTF-8/LC_NAME>, statbuf: 0x7ffd2015f230 ) = 0 0.305 ( 0.011 ms): ls/14234 mmap(len: 45560, prot: READ, flags: PRIVATE, fd: 3 ) = 0x7efe0af92000 0.324 Dockerfile test.sh ( 0.008 ms): ls/14234 close(fd: 3</usr/lib/locale/C.UTF-8/LC_NAME> ) = 0 …

However, perf trace can also perform system-level system call tracing (i.e., tracing all processes), while strace can only trace specific processes.

The second example using perf shows the user space library function. Taking the readline function called by bash as an example, you can trace the library function using a similar method (based on uprobes).

The purpose of readline is to read user input from the terminal and return this data to the caller. Therefore, unlike the open system call, we are more concerned about the result of the readline function call.

By executing the following command and specifying the path of the bash binary with -x, you can dynamically trace library functions. This is actually tracing all commands executed by users in bash:

Add readline probe to /bin/bash #

$ perf probe -x /bin/bash ‘readline%return +0($retval):string’

Record samples #

$ perf record -e probe_bash:readline__return -aR sleep 5

View results #

$ perf script bash 13348 [000] 93939.142576: probe_bash:readline__return: (5626ffac1610 <- 5626ffa46739) arg1=“ls”

Delete the probe after tracing is complete #

$ perf probe –del probe_bash:readline__return

Of course, if you are unsure about the probe format, you can also use the following command to query all supported functions and function parameters:

Query all functions #

$ perf probe -x /bin/bash –funcs

Query function parameters #

$ perf probe -x /bin/bash -V readline Available variables at readline @<readline+0> char* prompt

Similar to kernel functions, if you want to view the function names and parameters of ordinary applications, debug information needs to be included in the binary file of the application.

eBPF and BCC #

The functionalities of ftrace and perf are already quite rich, but they have a common defect, which is that they are not flexible enough and cannot be freely extended with scripts like DTrace.

eBPF is the Linux version of DTrace and can be freely extended using the C language (these extensions are converted to BPF byte code through LLVM and loaded into the kernel for execution). The following image represents the working principle of eBPF tracing:

(Image source: THE NEW STACK)

From the image, you can see that eBPF execution involves three steps:

Generate BPF bytecode from the user tracing program;
Load the bytecode into the kernel and execute it;
Output the results to the user space.

Therefore, in terms of usage, eBPF is more complicated than ftrace and perf that we’ve seen before.

In fact, in the process of eBPF execution, compilation, loading, and map operations are common to all tracing programs. BCC (BPF Compiler Collection) was born by abstracting these processes using Python. BCC converts various event sources (such as kprobe, uprobe, tracepoint, etc.) and data operations in eBPF into Python interfaces (also supports Lua). This means that when using BCC for dynamic tracing, you only need to write simple scripts.

However, please note that because BCC needs to interact with data structures in the kernel, the core event handling logic still needs to be written in C.

As for the installation method of BCC, I have already introduced it in the caching case of memory modules. If you haven’t installed it yet, you can execute the following command to install BCC (for installation on other systems, please refer to this link):

# Ubuntu
sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 4052245BD4284CDD
echo "deb https://repo.iovisor.org/apt/$(lsb_release -cs) $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/iovisor.list
sudo apt-get update
sudo apt-get install bcc-tools libbcc-examples linux-headers-$(uname -r)

# REHL 7.6
yum install bcc-tools

After installation, BCC will place all the examples (including Python and Lua) in the /usr/share/bcc/examples directory:

$ ls /usr/share/bcc/examples
hello_world.py  lua  networking  tracing

Next, let’s take the do_sys_open as an example and see how to achieve the same dynamic tracing using eBPF and BCC.

Usually, we can divide the use of BCC into the following four steps.

First, similar to using other Python modules, import the required modules before using them:

from bcc import BPF

Second, define the events and the functions to handle these events. This function needs to be written in C and is used to initialize the BPF object imported earlier. The processing functions written in C need to be passed to the BPF module as strings:

# define BPF program (""" is used for multi-line string).
# '#' indicates comments for python, while '//' indicates comments for C.
prog = """
#include <uapi/linux/ptrace.h>
#include <uapi/linux/limits.h>
#include <linux/sched.h>
// define output data structure in C
struct data_t {
    u32 pid;
    u64 ts;
    char comm[TASK_COMM_LEN];
    char fname[NAME_MAX];
};
BPF_PERF_OUTPUT(events);

// define the handler for do_sys_open.

// ctx is required, while other params depends on traced function.
int hello(struct pt_regs *ctx, int dfd, const char __user *filename, int flags){
    struct data_t data = {};
    data.pid = bpf_get_current_pid_tgid();
    data.ts = bpf_ktime_get_ns();
    if (bpf_get_current_comm(&data.comm, sizeof(data.comm)) == 0) {
        bpf_probe_read(&data.fname, sizeof(data.fname), (void *)filename);
    }
    events.perf_submit(ctx, &data, sizeof(data));
    return 0;
}
"""
# load BPF program
b = BPF(text=prog)
# attach the kprobe for do_sys_open, and set handler to hello
b.attach_kprobe(event="do_sys_open", fn_name="hello")

The third step is to define an output function and bind it to the BPF event:

# process event
start = 0
def print_event(cpu, data, size):
    global start
    # event’s type is data_t
    event = b["events"].event(data)
    if start == 0:
            start = event.ts
    time_s = (float(event.ts - start)) / 1000000000
    print("%-18.9f %-16s %-6d %-16s" % (time_s, event.comm, event.pid, event.fname))

# loop with callback to print_event
b["events"].open_perf_buffer(print_event)

The final step is to execute the event loop and start tracing the do_sys_open calls:

# print header
print("%-18s %-16s %-6s %-16s" % ("TIME(s)", "COMM", "PID", "FILE”))
# start the event polling loop
while 1:
    try:
        b.perf_buffer_poll()
    except KeyboardInterrupt:
        exit()

Save the above steps to file trace-open.py, and you can run it with Python. If all goes well, you should see the following output:

$ python trace-open.py
TIME(s)            COMM             PID    FILE
0.000000000        irqbalance       1073   /proc/interrupts
0.000175401        irqbalance       1073   /proc/stat
0.000258802        irqbalance       1073   /proc/irq/9/smp_affinity
0.000290102        irqbalance       1073   /proc/irq/0/smp_affinity

From the output, you can see that the irqbalance process (and possibly other processes) is opening many files, and irqbalance relies on the contents read from these files to perform interrupt load balancing.

From this simple example, you can also see that using eBPF and BCC has a higher learning curve than using ftrace and perf. To develop your own dynamic tracing program with BCC, you need to be familiar with C language, Python language, the characteristics of the traced event or function (such as the parameters and return format of kernel functions), and the various data manipulation methods provided by eBPF.

However, despite the higher learning curve, the powerful flexibility of eBPF has made it the most popular and widely discussed dynamic tracing technology.

Of course, the BCC package also includes many ready-to-use utility tools, which are installed by default in the /usr/share/bcc/tools/ directory. The usage scenarios of these tools are shown in the following figure:

(Image from Linux Extended BPF (eBPF) Tracing Tools)

These tools can generally be used directly. They are also the best reference materials when writing other dynamic tracing scripts. However, there is one thing you need to pay special attention to: many of the new features of eBPF require relatively new kernel versions (as shown in the following figure). If some tools cannot run, it is likely because they use features that are not supported by the current kernel.

(Image from Linux Extended BPF (eBPF) Tracing Tools)

SystemTap and sysdig #

In addition to the aforementioned ftrace, perf, eBPF, and BCC, SystemTap and sysdig are also commonly used dynamic tracing tools.

SystemTap is also a dynamic tracing technology that can be extended freely through scripts. Before the emergence of eBPF, SystemTap was the dynamic tracing mechanism in Linux systems that was closest to DTrace in terms of functionality. However, it is important to note that SystemTap has been separate from the kernel for a long time (while eBPF has been integrated into the kernel since its inception).

Therefore, in terms of stability, SystemTap works well only on RHEL systems, and it is prone to various abnormal issues on other systems. Of course, on the flip side, supporting older kernel versions such as 3.x is a huge advantage of SystemTap compared to eBPF.

On the other hand, sysdig was born with the popularization of container technology and is mainly used for dynamic tracing of containers. sysdig combines the advantages of a series of performance tools and can be said to be a collection of the best of the best. I am used to expressing the characteristics of sysdig with this formula: sysdig = strace + tcpdump + htop + iftop + lsof + docker inspect.

In the latest version (kernel version >= 4.14), sysdig can also be extended through eBPF, making it possible to trace various functions and events in the kernel.

How to choose a tracing tool #

By now, you might feel overwhelmed by the various dynamic tracing tools available, and you’re probably wondering how to choose the right tool for your specific scenario. As mentioned before, the selection of performance tools depends on their specific working principles.

In the previous two lessons, we’ve already introduced the principles and characteristics of common tools. Now, you can start thinking about differentiating the tools based on various scenarios. For example:

In scenarios where high flexibility is not necessary, using perf to sample performance events and then using flame graphs for analysis is the most common approach.
For analyzing event or function call statistics (such as observing I/O distribution for different sizes), tools like SystemTap or eBPF are used along with custom scripts for data processing.

I have also summarized several common scenarios for dynamic tracing and recommended suitable tools for each. You can save this table for easy reference and use.

Summary #

Today, I mainly taught you about dynamic tracing methods such as perf, eBPF, and BCC, and summarized how to choose the appropriate method for different scenarios.

In the Linux system, common dynamic tracing methods include ftrace, perf, eBPF, and SystemTap. For most performance issues, using perf with flame graphs is a good method. If that doesn’t meet your requirements, then:

In newer kernel versions, eBPF and BCC are the most flexible dynamic tracing methods;
In older kernel versions, especially in RHEL systems where eBPF support is restricted, SystemTap is often a better choice.

Furthermore, when using dynamic tracing techniques, in order to obtain detailed information about the analysis target, debug symbols for the kernel and application programs are generally required. Dynamic tracing is actually performed on these symbols (including functions and events), so easily readable and understandable symbols can help speed up the dynamic tracing process.

Reflection #

Finally, I would like to invite you to discuss the dynamic tracing technology as you understand it. Have you ever used dynamic tracing in a real-world environment? With so many dynamic tracing methods available, how do you usually choose? You can summarize your thoughts by combining today’s content with your own operational records.

Feel free to discuss with me in the comments section, and feel free to share this article with your colleagues and friends. Let’s practice in real-world scenarios and improve through communication.