Bonus How I Use Tracepoint to Analyze Kernel Bugs

Bonus How I Use tracepoint to Analyze Kernel Bugs #

Hello, I am Shaoyafang.

The target audience of this series of courses is application developers and operations personnel. Therefore, you can see that in the examples of the course, we try to avoid analyzing kernel bugs and avoid interpreting why the kernel code is behaving in a certain way function by function. I hope that this course can lower the barrier to entry for the kernel, allowing more people to easily understand kernel mechanisms and better solve application problems and improve application performance.

However, there are still some kernel developers among our learners. Therefore, I have written this bonus section to analyze kernel bugs, hoping to share some experience in analyzing kernel bugs with these kernel developers.

Through the course, you should be able to see that I highly recommend tracepoints and ftrace. There is a reason why I highly recommend them, and in this section, I will show you how I use tracepoints to analyze kernel bugs.

The Source Code of Tracepoint, an Exquisite Existence #

If you have looked at the kernel code of tracepoints, I believe you must be deeply impressed by its dazzling macro definitions. When I first saw these macro definitions, I was also confused and didn’t know where to start. However, I quickly understood them. Just to prove that I understood them, I even added some definitions to these macro definitions. The key part of the macro definition I added is as follows:

If you can understand these, it means that you have a clear understanding of the working mechanism of these tracepoint macros. Of course, I’m not here to dissect the kernel source code of tracepoints in this lesson. If you don’t understand the kernel source code of tracepoints, it won’t prevent you from using them. However, for a kernel developer, it is still regrettable.

Because I often use tracepoints, I also pay attention to some of its features. For example, recently, with my push, the maintainer of the tracepoint module, Steven Rostedt, added another macro definition to tracepoints. The reason why I pushed for Steven to add this macro was to allow tracepoint functions to be used in header files, in order to reduce the overhead caused by additional function calls. With this newly added macro, you can easily use tracepoints in header files.

The kernel bug I’m about to talk about next is analyzed with the help of tracepoints.

Why does memory allocation fail on the slow path? #

Once, the business personnel reported that they occasionally encountered failures when starting the program. After checking the kernel log, I found the following error messages (this system is CentOS-7, corresponding to kernel version 3.10):

kworker/31:0: page allocation failure: order:5, mode:0x104050
CPU: 31 PID: 635928 Comm: kworker/31:0 Tainted: G 
0000000000104050 000000009a44a60e ffff882016b93808 ffffffff81686b13
ffff882016b93898 ffffffff81187010 0000000000000000 ffff88207ffd8000
0000000000000005 0000000000104050 ffff882016b93898 000000009a44a60e
 Call Trace:
[<ffffffff81686b13>] dump_stack+0x19/0x1b
[<ffffffff81187010>] warn_alloc_failed+0x110/0x180
[<ffffffff816826a7>] __alloc_pages_slowpath+0x6b7/0x725
[<ffffffff8118b5c5>] __alloc_pages_nodemask+0x405/0x420
[<ffffffff811cf77a>] alloc_pages_current+0xaa/0x170
[<ffffffff81185eee>] __get_free_pages+0xe/0x50
[<ffffffff811db01e>] kmalloc_order_trace+0x2e/0xa0
[<ffffffff811e05d9>] __kmalloc_track_caller+0x219/0x230
[<ffffffff8119f78f>] krealloc+0x4f/0xa0
[<ffffffffa07eebe6>] osdmap_set_max_osd+0x76/0x1d0 [libceph]
[<ffffffffa07f14f6>] ceph_osdmap_decode+0x216/0x600 [libceph]
[<ffffffffa07ecce4>] handle_one_map+0x224/0x250 [libceph]
[<ffffffffa07ed98f>] ceph_osdc_handle_map+0x6cf/0x720 [libceph]
[<ffffffffa07e3340>] dispatch+0x350/0x7c0 [libceph]
[<ffffffffa07deecf>] try_read+0x4df/0x1260 [libceph]
[<ffffffffa07dfd09>] ceph_con_workfn+0xb9/0x650 [libceph]
[<ffffffff810a845b>] process_one_work+0x17b/0x470
[<ffffffff810a9296>] worker_thread+0x126/0x410
[<ffffffff810b0a4f>] kthread+0xcf/0xe0
[<ffffffff81697118>] ret_from_fork+0x58/0x90
Mem-Info:
active_anon:13891624 inactive_anon:358552 isolated_anon:0
active_file:1652146 inactive_file:14520052 isolated_file:0
unevictable:0 dirty:472 writeback:0 unstable:0
slab_reclaimable:1071185 slab_unreclaimable:201037
mapped:53234 shmem:1046930 pagetables:31965 bounce:0
free:966225 free_pcp:185 free_cma:0
Node 0 DMA free:15864kB min:48kB low:60kB high:72kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15948kB managed:15864kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
lowmem_reserve[]: 0 1700 64161 64161
Node 0 DMA32 free:261328kB min:5412kB low:6764kB high:8116kB active_anon:303220kB inactive_anon:2972kB active_file:51840kB inactive_file:239320kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:1952512kB managed:1741564kB mlocked:0kB dirty:0kB writeback:0kB mapped:1160kB shmem:15968kB slab_reclaimable:687588kB slab_unreclaimable:183756kB kernel_stack:6640kB pagetables:724kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 62460 62460
Node 0 Normal free:272880kB min:198808kB low:248508kB high:298212kB active_anon:26813396kB inactive_anon:912660kB active_file:2646440kB inactive_file:30619568kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:65011712kB managed:63960000kB mlocked:0kB dirty:36kB writeback:0kB mapped:144152kB shmem:2567180kB slab_reclaimable:1870364kB slab_unreclaimable:262608kB kernel_stack:12880kB pagetables:45840kB unstable:0kB bounce:0kB free_pcp:652kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:32 all_unreclaimable? no

Please let me know if there’s anything else I can assist you with! lowmem_reserve[]: 0 0 0 0 Node 1 Normal free:3315332kB min:205324kB low:256652kB high:307984kB active_anon:28449880kB inactive_anon:518576kB active_file:3910304kB inactive_file:27221320kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:67108864kB managed:66056916kB mlocked:0kB dirty:1852kB writeback:0kB mapped:67624kB shmem:1604572kB slab_reclaimable:1726788kB slab_unreclaimable:357784kB kernel_stack:18928kB pagetables:81296kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 Node 0 DMA: 04kB 18kB (U) 116kB (U) 132kB (U) 164kB (U) 1128kB (U) 1256kB (U) 0512kB 11024kB (U) 12048kB (M) 34096kB (M) = 15864kB Node 0 DMA32: 369134kB (UEM) 140878kB (UEM) 4416kB (UEM) 1732kB (UEM) 064kB 0128kB 0256kB 0512kB 01024kB 02048kB 04096kB = 261596kB Node 0 Normal: 696294kB (UEM) 4118kB (UEM) 116kB (E) 332kB (E) 064kB 0128kB 0256kB 0512kB 01024kB 02048kB 04096kB = 281916kB Node 1 Normal: 2417014kB (UEM) 2407348kB (UEM) 2401016kB (UEM) 99032kB (UEM) 17564kB (UEM) 20128kB (UEM) 3256kB (EM) 0512kB 01024kB 02048kB 04096kB = 3323044kB Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB

From the above log, it can be seen that this is a memory allocation failure in __alloc_pages_slowpath(), which is why this log is printed:

__alloc_pages_slowpath
{
    ...
nopage:
    // Error log printed here
    warn_alloc_failed(gfp_mask, order, NULL);
    return NULL;
}

At this time, the size of memory being requested is order 5, which corresponds to 32 consecutive pages. Following that, we can see the specific memory usage of each node. There are 2 nodes in total on this machine:

Node 0 DMA free: 15864kB min: 48kB
Node 0 DMA32 free: 261328kB min: 5412kB
Node 0 Normal free: 272880kB min: 198808kB
Node 1 Normal free: 3315332kB min: 205324kB

From this, we can see that the free memory of each zone is greater than the min value required, and the difference is greater than 32 pages. In other words, from the perspective of free memory, each zone should be able to meet the memory requirements. So, why is the memory allocation failing at this time?

Next, let’s analyze the reasons for the failure together.

Investigating each possible scenario #

For version 3.10 of the kernel, the reasons for failure in the slow path of memory allocation can be divided into the following three scenarios:

Special GFP flags
Process state
Insufficient reclaim and compact capabilities

Now, let’s investigate each scenario to determine the cause.

GFP flags #

At this moment, the GFP flags are set to 0x104050, corresponding to the following:

#define ___GFP_WAIT             0x10u
#define ___GFP_IO               0x40u
#define ___GFP_COMP             0x4000u
#define ___GFP_KMEMCG           0x100000u

Have you ever wondered why these GFP flags are not directly printed in the kernel log? If you have, congratulations! You possess the qualities of a kernel developer. If you haven’t, then you need to improve your thinking in this regard. I think it would be better to print the corresponding GFP flags directly in the kernel log. So, I checked the latest kernel code and found that this part has already been modified in the new version. It seems that other kernel developers share the same idea as me. Of course, this also indicates how frustrating it can be to develop with an older version of the kernel. You may find that the kernel bug you analyzed in the older version has already been fixed by someone else in the newer version, which greatly limits our ability to contribute.

From the previous call stack, we can see that memory allocation is performed in the function osdmap_set_max_osd(). The corresponding kernel code is as follows:

osdmap_set_max_osd
    addr = krealloc(map->osd_addr, max*sizeof(*addr), GFP_NOFS);
    if (!addr)
        return -ENOMEM;

Here, we see that the GFP flags used are GFP_NOFS, which is defined as:

#define GFP_NOFS        (__GFP_WAIT | __GFP_IO)

This matches the GFP flags seen in the kernel log. Next, let’s compare the situation of goto nopage; and see that these flags do not cause this situation, in other words, they do not cause nopage.

Now, let me briefly explain the purpose of GFP_NOFS. Its purpose is to prevent direct memory reclamation from reclaiming pages that are currently being used for I/O, which can lead to deadlock in certain scenarios. What situations can cause deadlock? You can refer to my unfinished patch: PATCH: xfs: avoid deadlock when trigger memory reclaim in ->writepages. This problem exists in both the 3.10 version and the latest version of the kernel. The reason why I haven’t completed this patch is that it depends on another set of patches which I currently have very little time to work on. You can check the code and description of this patch for more details, but I won’t go into them here.

Now that we have ruled out the possibility of nopage caused by the GFP flags, let’s continue to examine the other two scenarios.

Process flag current->flags #

In the warn_alloc_failed() function, we can see that if the failure is due to the process’s own state (such as exiting or being in the oom process), the SHOW_MEM_FILTER_NODES flag will be cleared, and the specific memory information for each zone will not be printed.

Therefore, memory allocation failure in the slow path is not due to this reason.

Insufficient reclaim and compact capabilities #

Now, the only remaining scenario is “insufficient reclaim and compact capabilities”.

Based on the information we gathered earlier, we know that there is actually a considerable amount of free memory at this time, so we can exclude the possibility of reclaim being unable to meet the requirements. So, only compact remains as a factor. Specifically, since contiguous pages are being requested and the system does not have enough contiguous pages, compact cannot fulfill the request and this leads to the allocation failure.

Under what circumstances would compact fail? Let’s continue to review the code:

try_to_compact_pages
    int may_enter_fs = gfp_mask & __GFP_FS;
    int may_perform_io = gfp_mask & __GFP_IO;

    if (!order || !may_enter_fs || !may_perform_io)
        return rc;

We can see that __GFP_FS is not set, so compaction cannot be performed, and the function directly returns COMPACT_SKIPPED.

Now that we understand the problem, we need to verify it in the production environment to confirm whether this is the actual cause.

Analyzing Production Environment Using Tracepoints #

Tracepoints are a low-overhead tracing mechanism that does not significantly impact performance in a production environment.

Before using tracepoints for analysis, we need to understand what events can be traced.

Since our current issue is “compact fail,” we need to trace the “direct compact” event. There are tracepoints related to compact in newer kernel versions, and we can simply enable the corresponding tracepoint. However, for the 3.10 kernel version, there is no tracepoint related to compact. In this case, we need to use the kprobe mechanism, and the simplest way is to utilize either the kprobe_events functionality provided by ftrace or the function tracer functionality. Let’s take the function tracer as an example to trace “direct compact”:

$ echo function > /sys/kernel/debug/tracing/current_tracer
$ echo __alloc_pages_direct_compact > /sys/kernel/debug/tracing/set_ftrace_filter

With these commands, when “direct compact” occurs, the corresponding information will be output to the trace_pipe. However, this does not show the details of the compact. We need to combine other methods to see if the compact has been performed. There are many ways to do this, and here we will look at a relatively simple method by examining the source code:

 __alloc_pages_direct_compact
   try_to_compact_pages
     /* Check if the GFP flags allow compaction */
     if (!order || !may_enter_fs || !may_perform_io)
       return rc;

    // If direct compact is possible, COMPACTSTALL event will occur
    count_compact_event(COMPACTSTALL);

From the source code, we can see that if direct compact is possible, there will be a corresponding COMPACTSTALL event, and this event will be recorded in /proc/vmstat:

$ cat /proc/vmstat | grep compact

This way, we can determine whether compact behavior has occurred when __alloc_pages_direct_compact is called. Additionally, during the compact process, there is also direct reclaim associated with it. We also need to examine the details of direct reclaim to see if direct reclaim can successfully reclaim memory. We can use the tracepoint for direct reclaim, which is already available in the 3.10 kernel version:

$ echo 1 > /sys/kernel/debug/tracing/events/vmscan/mm_vmscan_direct_reclaim_begin/enable
$ echo 1 > /sys/kernel/debug/tracing/events/vmscan/mm_vmscan_direct_reclaim_end/enable

After tracing these events, we can see the information before and after direct compact.

The vmstat indicators before direct compact are:

$ cat /proc/vmstat | grep compact
compact_migrate_scanned 690141298
compact_free_scanned 186406569096
comoact_isolated 332972232
compact_stall 87914
compact_fail 40935
compact_success 46979

The events during the compact process are:

The vmstat indicators after the compact process are:

$ cat /proc/vmstat | grep compact
compact_migrate_scanned 690141298
compact_free_scanned 186406569096
comoact_isolated 332972232
compact_stall 87914
compact_fail 40935
compact_success 46979

From the results, we can see that there is no change in the “compact_stall” indicator before and after the compact, which means there was no actual compact behavior in try_to_compact_pages. The “nr_reclaimed=3067” in the direct reclaim event indicates that enough pages can be reclaimed, meaning there is no issue with direct reclaim. Similarly, the “order=5, gfp_flags=GFP_NOFS|GFP_COMP” in the direct reclaim event matches the information in the previous logs. Therefore, this tracking data further confirms our previous speculation that “__GFP_FS” was not set, preventing compaction.

Now, when we observe the logs again when the memory allocation fails, we can see that there is actually memory available on the free list for the current order (since there is no GFP_DMA, memory is first allocated from the NORMAL zone):

Node 0 Normal: 69629*4kB (UEM) 411*8kB (UEM) 1*16kB (E) 3*32kB (E) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 281916kB
Node 1 Normal: 241701*4kB (UEM) 240734*8kB (UEM) 24010*16kB (UEM) 990*32kB (UEM) 175*64kB (UEM) 20*128kB (UEM) 3*256kB (EM) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3323044kB

We can see that Node 1 has a lot of contiguous physical memory larger than order 5 (128K). So why can’t we allocate memory from these zone->free_area[order]?

The answer lies in the fact that the watermark for that zone does not meet the requirements (see __zone_watermark_ok()):

__zone_watermark_ok
{
    ...
    for (o = 0; o < order; o++) {
        free_pages -= z->free_area[o].nr_free << o;
        min >>= 1;
        if (free_pages <= min)
            return false;
    }
}

For Node 1, the sum of 4K/8K/16K/32K/64K memory is 3319716kB, the watermark min for that zone is 205324kB, and the total memory for that node is 3323044kB. We can make the following comparison:

(3323044-3319716) is 3328kB
(205324kB >> 5) is 6416kB

Therefore, order 5 fails to meet the watermark requirements.

Based on these tracing information, we can conclude that the failure on the slow path of memory allocation is due to the current memory allocation request not being able to perform compaction, and there is excessive low-order memory in the system, resulting in the failure of memory allocation for that order.

Solution #

Since there are still a lot of free pages with order=5 in the normal zone at this time, and the overall watermark meets the requirements, the memory allocation of order=5 should not fail. This is a kernel defect. When I checked the latest code upstream, I found that this defect has already been fixed. You can see that using an older version of the kernel for development can be quite frustrating.

Regarding the upstream solution, you can refer to this patch and its dependencies:

mm, page_alloc: only enforce watermarks for order-0 allocations

If you are unable to solve this problem by modifying the kernel, then you can take some mitigating measures.

Mitigation Option 1: - Use drop_caches to clean up the page cache. However, this approach has many flaws. For more details, you can refer to the page cache module of our course, but I won’t go into detail here.
Mitigation Option 2: - Adjust the fragmentation index to ensure that continuous pages can be allocated in emergency situations. The file corresponding to the memory fragmentation index is /proc/sys/vm/extfrag_threshold, and its default value is 500. We can lower this value appropriately, but in this case, lowering its value will not have a very obvious effect.
Mitigation Option 3: - Perform manual compaction. You can trigger compaction by writing to /proc/sys/vm/compact_memory.
Mitigation Option 4: - Adjust vm.vfs_cache_pressure and lower its value to reclaim more page cache, thereby reducing the number of order=0 pages in the free list.

With this, our analysis of this problem is complete.

Summary #

In newer kernels, we can also use eBPF to analyze kernel bugs. For example, in our production environment, I traced a business jitter problem caused by a kernel bug in fadvise using eBPF. You can find more details about this bug in the PATCH: mm,fadvise: improve the expensive remote LRU cache draining after FADV_DONTNEED that I contributed to the kernel. This further confirms my belief that kernel developers can achieve more only by developing on newer kernel versions.

Additionally, although eBPF is powerful, it is still difficult for me to replace ftrace in my mind.