18 Case Study Whether the Business Needs to Use Transparent Huge Pages for Better Performance

18 Case Study Whether the Business Needs to Use Transparent HugePages for Better Performance #

Hello, I’m Shao Yafang.

The case study for this lesson comes from an stability issue that I helped the business team analyze many years ago. At that time, the business team reported that the CPU utilization of some servers would suddenly spike and then quickly recover, lasting for only a short period of time, usually a few seconds to a few minutes. From the monitoring chart, it appeared like some spikes.

Because issues like this are common, I’d like to share the process of diagnosing this particular problem with you, so that you’ll know how to analyze CPU utilization spikes step by step when you encounter them in the future.

CPU utilization is a broad concept. When confronted with high CPU utilization problems, we need to determine what the CPU is busy doing, such as handling interrupts, waiting for I/O, executing kernel functions, or executing user functions. This is when we need to have a detailed monitoring of CPU utilization, as monitoring these specific metrics can greatly assist us in analyzing the problem.

Refining CPU Utilization Monitoring #

Here, we take the commonly used top command as an example to see the more refined utilization indicators of the CPU (the display of different versions of the top command may vary slightly):

%Cpu(s): 12.5 us, 0.0 sy, 0.0 ni, 87.4 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st

The top command displays several indicators: us, sy, ni, id, wa, hi, si, and st. The sum of these indicators is 100. You may wonder whether monitoring the utilization indicators of the CPU in such detail will bring obvious additional overhead. The answer is no because CPU utilization monitoring usually involves parsing the /proc/stat file, which already contains these refined indicators.

Let’s continue to look at the specific meanings of these indicators, which you can also find in the top manual:

us, user    : time running un-niced user processes
sy, system  : time running kernel processes
ni, nice    : time running niced user processes
id, idle    : time spent in the kernel idle handler
wa, IO-wait : time waiting for I/O completion
hi : time spent servicing hardware interrupts
si : time spent servicing software interrupts
st : time stolen from this vm by the hypervisor

The specific meanings and considerations of the above indicators are as follows:

Among these items, idle and wait are the times when the CPU is not working, while the other items represent the times when the CPU is working. The main difference between idle and wait is that idle means the CPU has nothing to do, while wait means the CPU wants to do something but can’t. You can also understand wait as a special type of idle, which occurs when at least one thread on the CPU is blocked waiting for I/O.

By monitoring the utilization of the CPU in a more refined manner, we found that the high CPU utilization in the case was caused by an increase in the sys utilization. In other words, the sys utilization suddenly spikes, for example, when the usr utilization is below 30%, the sys utilization is higher than 15% for a few seconds, and then it returns to normal.

Therefore, we now need to capture the scene of the sudden spike in sys utilization.

Capturing High sys Usage Scenes #

As we mentioned earlier, when the sys usage of the CPU is high, it indicates that the kernel functions are taking up too much time. Therefore, we need to capture the kernel functions that are executed when the CPU’s sys usage spikes. There are many methods to capture kernel functions, such as:

Using perf to capture CPU hotspots and see which kernel-consuming CPU has high usage when the sys usage is high;
Using perf’s call-graph feature to view specific call stack information, i.e., which paths the thread is executed from;
Using perf’s annotate feature to trace which statements in the kernel functions are time-consuming for the thread;
Using ftrace’s function-graph feature to view the specific time consumption of these kernel functions and the paths they consume the most time on.

However, these commonly used tracing methods are not suitable for this kind of momentary disappearing problem because they are more suitable for capturing information over a period of time.

So for this transient state, I hope to have a system snapshot that records the current work the CPU is doing, so that we can analyze why the sys usage is high combined with the kernel source code.

There is a tool that can trace this kind of transient system state, which is system request (sysrq). sysrq is a tool I often use to analyze kernel problems. With it, you can observe the current memory snapshot, task snapshot, construct a vmcore to save all the system information, and even kill the process with the biggest memory consumption when memory is tight. sysrq can be said to be a powerful tool for analyzing many difficult problems.

To use sysrq to analyze problems, first, you need to enable sysrq. I suggest you enable all the functions of sysrq. You don’t have to worry about any extra overhead, and there is no risk. The enabling method is as follows:

$ sysctl -w kernel.sysrq = 1

After enabling the functions of sysrq, you can use its -t option to save the current task snapshot and see which tasks are existing in the system and what they are doing. The usage is as follows:

$ echo t > /proc/sysrq-trigger

Then the task snapshot will be printed to the kernel buffer, and you can view these task snapshot information using the dmesg command:

$ dmesg

When I wanted to capture this kind of transient state, I wrote a script to capture it. Below is a simple script example:

#!/bin/sh

while [ 1 ]; do
     top -bn2 | grep "Cpu(s)" | tail -1 | awk '{
         # $2 is usr, $4 is sys.
         if ($2 < 30.0 && $4 > 15.0) {
              # save the current usr and sys into a tmp file
              while ("date" | getline date) {
                   split(date, str, " ");
                   prefix=sprintf("%s_%s_%s_%s", str[2],str[3], str[4], str[5]);
               }

              sys_usr_file=sprintf("/tmp/%s_info.highsys", prefix);
              print $2 > sys_usr_file;
              print $4 >> sys_usr_file;

              # run sysrq
              system("echo t > /proc/sysrq-trigger");
         }
     }'
     sleep 1m
done

This script will check for situations where the sys usage is higher than 15% and the usr usage is low, indicating that the CPU has spent too much time in the kernel. If this situation occurs, it will run sysrq to save the current task snapshot. You can see that this script is set to execute every minute. The reason for doing this is that I didn’t want to cause a significant performance overhead. Also, in the team, there were several machines where this condition occurred two or three times a day. Some machines could last for a few minutes each time, so this interval was sufficient. However, if the problem you encounter occurs less frequently and for a shorter duration, you will need a more precise method.

Transparent Pages: Can Water Carry a Boat as well as Overturn It? #

After deploying the script, we captured the on-site issues. From the information outputted by dmesg, we found that the threads in the R state were performing compaction (memory defragmentation). The call stack of the threads is as follows (this is a relatively old kernel version, 2.6.32):

    java          R  running task        0 144305 144271 0x00000080
     ffff88096393d788 0000000000000086 ffff88096393d7b8 ffffffff81060b13
     ffff88096393d738 ffffea003968ce50 000000000000000e ffff880caa713040
     ffff8801688b0638 ffff88096393dfd8 000000000000fbc8 ffff8801688b0640
    
    Call Trace:
     [<ffffffff81060b13>] ? perf_event_task_sched_out+0x33/0x70
     [<ffffffff8100bb8e>] ? apic_timer_interrupt+0xe/0x20
     [<ffffffff810686da>] __cond_resched+0x2a/0x40
     [<ffffffff81528300>] _cond_resched+0x30/0x40
     [<ffffffff81169505>] compact_checklock_irqsave+0x65/0xd0
     [<ffffffff81169862>] compaction_alloc+0x202/0x460
     [<ffffffff811748d8>] ? buffer_migrate_page+0xe8/0x130
     [<ffffffff81174b4a>] migrate_pages+0xaa/0x480
     [<ffffffff81169660>] ? compaction_alloc+0x0/0x460
     [<ffffffff8116a1a1>] compact_zone+0x581/0x950
     [<ffffffff8116a81c>] compact_zone_order+0xac/0x100
     [<ffffffff8116a951>] try_to_compact_pages+0xe1/0x120
     [<ffffffff8112f1ba>] __alloc_pages_direct_compact+0xda/0x1b0
     [<ffffffff8112f80b>] __alloc_pages_nodemask+0x57b/0x8d0
     [<ffffffff81167b9a>] alloc_pages_vma+0x9a/0x150
     [<ffffffff8118337d>] do_huge_pmd_anonymous_page+0x14d/0x3b0
     [<ffffffff8152a116>] ? rwsem_down_read_failed+0x26/0x30
     [<ffffffff8114b350>] handle_mm_fault+0x2f0/0x300
     [<ffffffff810ae950>] ? wake_futex+0x40/0x60
     [<ffffffff8104a8d8>] __do_page_fault+0x138/0x480
     [<ffffffff810097cc>] ? __switch_to+0x1ac/0x320
     [<ffffffff81527910>] ? thread_return+0x4e/0x76e
     [<ffffffff8152d45e>] do_page_fault+0x3e/0xa0
     [<ffffffff8152a815>] page_fault+0x25/0x30

From this call stack, we can see that at this time the Java thread is requesting THP (do_huge_pmd_anonymous_page). THP, or Transparent Huge Pages, is a contiguous block of 2MB physical memory. However, because there is no continuous 2MB memory space available in physical memory at this time, direct compaction is triggered. The process of memory compaction can be represented by the following diagram:

This process is not complicated. During compaction, the thread will scan the used movable pages from front to back, and then scan the free pages from back to front. After the scanning is complete, these movable pages will be relocated to the free pages, resulting in a contiguous block of 2MB physical memory, which allows THP to successfully allocate memory.

This direct compaction process is time-consuming. On top of that, in the 2.6.32 version of the kernel, this process requires holding coarse-grained locks, which can be heavily contested. Hence, during runtime, the thread may proactively check (_cond_resched) whether there are other tasks with higher priorities that need to be executed. If there are, it will let other threads execute first, further increasing the execution time. This is also the reason for the high sys utilization. You can also see these details from the comments in the kernel source code:

/*
 * Compaction requires the taking of some coarse locks that are potentially
 * very heavily contended. Check if the process needs to be scheduled or
 * if the lock is contended. For async compaction, back out in the event
 * if contention is severe. For sync compaction, schedule.
 * ...
 */

After finding the cause, in order to quickly resolve these issues in the production environment, we disabled THP on the business server. After disabling THP, the system became stable, and there have been no more issues with high sys utilization. You can use the following command to disable THP:

$ echo never > /sys/kernel/mm/transparent_hugepage/enabled

After disabling THP in the production environment, we also evaluated the performance impact of THP on this business in the offline testing environment. We found that THP did not bring significant performance improvements to the business, even in situations where there was no memory pressure and memory compaction was not triggered. This got me thinking, which type of business is THP suitable for?

This brings us to the purpose of THP. In short, the purpose of THP is to use one page table entry to map a larger block of memory (huge page), which reduces page faults because fewer pages are needed. Of course, this also improves the TLB hit rate because fewer page table entries are needed. If the data that the process needs to access is all within this huge page, then this huge page will be hot and will be cached. The page table entry corresponding to the huge page will also appear in the TLB, and as we know from the previous lesson on storage hierarchy, this contributes to performance improvement. However, conversely, if the data locality of the application is poor, meaning it needs to access data located on different huge pages in a short period of time in a random manner, then the advantage of huge pages will disappear.

Therefore, when optimizing a business based on huge pages, the first thing to consider is the data locality of the business. Try to aggregate the hot data of the business together in order to fully benefit from huge pages. As an example, during my time at Huawei, when optimizing performance using huge pages, we would aggregate the hot data of the business together, allocate these hot data to huge pages, and then compare the results with the situation where huge pages are not used. In the end, we found that this can bring more than a 20% performance improvement. For architectures with a small TLB (such as the MIPS architecture), it can bring more than a 50% performance improvement. Of course, we also made many optimizations to the kernel code for huge pages during this process, but I won’t go into detail here.

Regarding the use of THP, here are a few suggestions:

Do not configure /sys/kernel/mm/transparent_hugepage/enabled as “always”; you can configure it as “madvise.” If you are unsure how to configure it, then set it to “never.”
If you want to optimize your business with THP, it is best to let your business use huge pages in a way that’s suitable for your business by modifying your business code to specify which data should use THP. After all, your business is more familiar with its own data flow.
Modifying business code can be cumbersome, so if you don’t want to modify your business code, then optimize the kernel code for THP.

Okay, this is the end of the lesson.

Class Summary #

Let’s recap the key points of this class:

Refine CPU utilization monitoring. When the CPU utilization is high, you need to examine which specific indicator is high.
sysrq is a powerful tool for analyzing high CPU utilization in kernel mode, as well as for analyzing many difficult kernel issues. You need to understand how to use it.
THP (Transparent Huge Pages) can bring performance improvements to your business, but it can also cause serious stability issues. It is best to use it with madvise. If you are unsure how to use it, it is better to disable it.

Homework #

There are three types of homework for this lesson. You can choose the one that suits you:

If you are an application developer, how can you observe how much Transparent Huge Pages (THP) are allocated in the system?
If you are a junior kernel developer, what pages can be migrated during compaction, and what pages cannot be migrated?
If you are a senior kernel developer, suppose you are tasked with designing the code segment of a program to use hugetlbfs. What do you think needs to be done?

Feel free to discuss with me in the comments.

Thank you for reading. If you found this lesson helpful, you are welcome to share it with your friends. See you in the next lecture.