19 Case Study Should Network Packet High Throughput Business Enable NIC Features

19 Case Study Should Network Packet High Throughput Business Enable NIC Features #

Hello, I’m Shaoyafang.

Based on our discussion in the previous lesson about CPU utilization, I believe you already know that the goal for applications is to minimize CPU overhead and focus on executing user code. The higher the usr utilization, the more efficient the CPU is. If usr is low, it means that the CPU’s execution efficiency for the application is not high. In Lesson 18, we also discussed cases where CPU time was wasted in sys. In today’s lesson, let’s take a look at the problem of decreased business performance caused by excessive time spent on softirq by the CPU. This is a common problem we often encounter in production environments. Next, I will explain relevant case studies and common observation methods for this type of problem.

How do interruptions and business processes interfere with each other? #

This is a case I encountered years ago. At that time, the business feedback mentioned that in order to improve the Query per Second (QPS), they enabled Receive Packet Steering (RPS) to simulate multi-queue network cards. Unexpectedly, enabling RPS instead led to a significant decrease in QPS, and they didn’t know the reason.

In fact, problems caused by performance degradation due to specific behavioral changes are relatively easier to analyze. The simplest way is to compare the performance data before and after this behavior. Even if you are not clear about what RPS is and the mechanism behind it, you can collect the required performance indicators for comparative analysis and then determine where the potential problem lies. These performance indicators include CPU metrics, memory metrics, I/O metrics, network metrics, etc., and we can use dstat to observe their changes.

Performance indicators before the business enabled RPS:

$ dstat 
You did not select any stats, using -cdngy by default.
----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw 
 64  23   6   0   0   7|   0  8192B|7917k   12M|   0     0 |  27k 1922 
 64  22   6   0   0   8|   0     0 |7739k   12M|   0     0 |  26k 2210 
 61  23   9   0   0   7|   0     0 |7397k   11M|   0     0 |  25k 2267

Performance indicators after enabling RPS:

$ dstat 
You did not select any stats, using -cdngy by default.
----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw 
 62  23   4   0   0  12|   0     0 |7096k   11M|   0     0 |  49k 2261 
 74  13   4   0   0   9|   0     0 |4003k 6543k|   0     0 |  31k 2004 
 59  22   5   0   0  13|   0  4096B|6710k   10M|   0     0 |  48k 2220

We can see that after enabling RPS, the CPU utilization has increased. Specifically, the siq, which represents software interrupt utilization, has significantly increased, and the int, which represents the frequency of hardware interrupts, has also noticeably increased. In the net section, the network throughput data has decreased. In other words, with a decrease in network throughput, both the hardware and software interrupts of the system have increased. From this, we can infer that the decrease in network throughput is likely caused by an increase in interrupts.

Next, we need to analyze what types of software and hardware interrupts have increased in order to find the source of the problem.

There are many hardware interrupts in the system, and we can check the occurrence frequency of these interrupts using the /proc/interrupts file:

$ cat /proc/interrupts

If you want to know the details of a specific interrupt, such as interrupt affinity, you can check it using /proc/irq/[irq_num], for example:

$ cat /proc/irq/123/smp_affinity

Software interrupts can be viewed using /proc/softirqs:

$ cat /proc/softirqs

Of course, you can also write scripts to observe the occurrence frequency of various hardware and software interrupts to more intuitively identify which interrupts are occurring too frequently. You may have some understanding of the differences between hardware interrupts and software interrupts. Software interrupts are used to handle tasks that cannot be completed in a short period of time by hardware interrupts. Hardware interrupts generally do not have a significant impact on business unless their occurrence frequency is high and their execution time is short. However, because there are too many places in the kernel where interrupts are disabled, processes often have some impact on hardware interrupts. For example, if a process disables interrupts for too long, it may cause network packets not to be processed in a timely manner, resulting in business performance fluctuations.

We’ve encountered a case in production where long interrupt disabling time caused performance fluctuations, for example, the excessive interrupt disabling logic in the cat /proc/slabinfo command causes RT jitter. This is because the command counts the number of slabs in the system and displays them, and interrupts are disabled during the counting process. If there are too many slabs in the system, the interrupt disabling time will be too long, resulting in network packet blocking and significant increase in ping latency. Therefore, in production environments, we should avoid collecting /proc/slabinfo as much as possible, otherwise it may cause business fluctuations.

Due to the potential danger of /proc/slabinfo, its access permissions have been changed from 0644 in kernel version 2.6.32 to 0400, which means only the root user can read it. This to some extent avoids some problems. If your system is running kernel version 2.6.32, you need to pay special attention to this issue.

If you want to analyze problems caused by excessively long interrupt disabling time, there is a simple way to use the irqsoff function of ftrace. It can not only measure how long interrupts are disabled, but also determine why interrupts are being disabled. However, note that the irqsoff function depends on the CONFIG_IRQSOFF_TRACER configuration option. If your kernel does not have this option enabled, you need to use other methods for tracing.

So how do you use irqsoff? First, you need to check if your system supports the irqsoff tracer:

$ cat /sys/kernel/debug/tracing/available_tracers

If the displayed content includes irqsoff, it means that the system supports this function and you can enable it for tracing:

$ echo irqsoff > /sys/kernel/debug/tracing/current_tracer

Next, you can observe the irqsoff events in the system through the /sys/kernel/debug/tracing/trace_pipe and trace files.

We know that compared to hardware interrupts, software interrupts have a longer execution time and can preempt the CPU of the currently running process, causing the process to wait during its execution. Therefore, it is relatively easier for software interrupts to cause delays in business. So how do we observe the frequency and execution time of software interrupts? You can use the following two tracepoints for observation:

/sys/kernel/debug/tracing/events/irq/softirq_entry
/sys/kernel/debug/tracing/events/irq/softirq_exit

These two tracepoints represent the entry and exit of software interrupts, and subtracting the exit time from the entry time gives the execution time for that particular software interrupt. As for how to analyze the data collected by tracepoints, we have discussed it several times in previous lessons and will not describe it further here.

If your kernel version is relatively new and supports eBPF functionality, you can also use the softirqs.py tool in BCC to observe. It can count the number of software interrupts and their execution time, which is convenient for analyzing the business latency caused by software interrupts.

To prevent software interrupts from occurring too frequently and starving processes of CPU time, the kernel introduced the ksoftirqd mechanism. If all software interrupts cannot be handled in a short period of time, the kernel will wake up ksoftirqd to handle the remaining software interrupts. ksoftirqd is a per-CPU kernel thread, with each CPU having its own ksoftirqd thread. For example, for CPU1, it runs the ksoftirqd/1 thread. After being awakened, ksoftirqd/1 checks the software interrupt vector table and processes the software interrupts. If you use ps to check the priority of ksoftirqd/1, you will find that it is just an ordinary thread (corresponding to a Nice value of 0):

$ ps -eo "pid,comm,ni" | grep softirqd
    9 ksoftirqd/0       0
   16 ksoftirqd/1       0
   21 ksoftirqd/2       0
   26 ksoftirqd/3       0

In summary, there is still much room for improvement in the kernel’s handling of software interrupts.

How does softirq affect business? #

After observing hardware interrupts and softirqs, we found that enabling RPS (Receive Packet Steering) increases the number of CAL (Function Call Interrupts) hardware interrupts. CAL is a way to trigger hardware interrupts through software, specifying the CPU and the interrupt handler to execute. It is commonly used for inter-processor communication (IPI), allowing one CPU to request another CPU to execute a specific interrupt handler through CAL interrupts.

If you are familiar with the mechanism of RPS, you should know that RPS uses CAL to allow other CPUs to receive network packets. To verify this, we can observe the CPU utilization using the mpstat command.

Here is the CPU utilization before enabling RPS:

$ mpstat -P ALL 1
Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
Average:     all   70.18    0.00   19.28    0.00    0.00    5.86    0.00    0.00    0.00    4.68
Average:       0   73.25    0.00   21.50    0.00    0.00    0.00    0.00    0.00    0.00    5.25
Average:       1   58.85    0.00   14.46    0.00    0.00   23.44    0.00    0.00    0.00    3.24
Average:       2   74.50    0.00   20.00    0.00    0.00    0.00    0.00    0.00    0.00    5.50
Average:       3   74.25    0.00   21.00    0.00    0.00    0.00    0.00    0.00    0.00    4.75

And here is the CPU utilization after enabling RPS:

$ mpstat -P ALL 1
Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
Average:     all   66.21    0.00   17.73    0.00    0.00   11.15    0.00    0.00    0.00    4.91
Average:       0   68.17    0.00   18.33    0.00    0.00    7.67    0.00    0.00    0.00    5.83
Average:       1   60.57    0.00   15.81    0.00    0.00   20.80    0.00    0.00    0.00    2.83
Average:       2   69.95    0.00   19.20    0.00    0.00    7.01    0.00    0.00    0.00    3.84
Average:       3   66.39    0.00   17.64    0.00    0.00    8.99    0.00    0.00    0.00    6.99

We can see that after enabling RPS, softirq processing is more evenly distributed among the CPUs. Previously, only CPU 1 was handling softirqs, but with RPS enabled, each CPU will handle softirqs, and the softirq utilization on CPU 1 decreases slightly. This is the effect of RPS: balancing the softirq processing among the CPUs to prevent bottlenecking on a single CPU. As you can see, enabling RPS increases the overall %soft utilization compared to before.

In theory, having more CPUs processing softirqs should allow more network packets to be processed in a unit of time and improve the overall system throughput. However, in our case, enabling RPS resulted in a decrease in business QPS (Query Per Second).

The answer can also be found in the CPU utilization. Before enabling RPS, the CPU utilization was already high, exceeding 90%. This means that the CPU was already overloaded. Enabling RPS consumes additional CPU time to simulate NIC multi-queue features, which further burdens the CPU and squeezes the processing time for user processes. Therefore, we can see a decrease in the %usr utilization after enabling RPS.

%usr is an indicator of the time spent on user process execution. A higher %usr implies more time spent on business code execution. If the %usr utilization decreases, it means that less time is spent on business code execution. Without optimizing the business code, this is a dangerous signal.

From this, we can conclude that the essence of RPS is to bring the NIC feature (NIC multi-queue) onto the CPU, sacrificing CPU time to improve network throughput. If your system is already busy, using this feature would be like adding insult to injury. Therefore, you need to be aware that using RPS requires the overall CPU utilization of the system to not be too high.

After identifying the problem, we disabled the RPS feature of the system. If you have a newer network card, it may support hardware multi-queue. Hardware multi-queue balances the load within the network card itself, and in such a scenario, it can be very helpful. We know that the opposite direction of “upload” is “offload,” which offloads the CPU workload to the network card, freeing up the CPU to have more time for user code execution. The discussion about offload features of network cards is beyond the scope of this article.

That concludes this lesson.

Classroom Summary #

Let’s briefly review the key points of this class:

Hard interrupts, soft interrupts, and the kernel thread ksoftirqd. These are the areas where business jitter is relatively easy to occur, and you need to understand how to observe them.
The main impact of hard interrupts on business is reflected in the frequency of hard interrupts, but it is also easily affected by threads because there are many places in the kernel where interrupts are disabled.
If the execution time of soft interrupts is too long, it will bring delay to user threads. You need to avoid having time-consuming soft interrupt handlers in your system. If you do have them, you need to optimize them.
The priority of ksoftirqd is consistent with user threads. Therefore, if the soft interrupt handler is executed in ksoftirqd, it may have some delay.
The essence of RPS is unloading the network card characteristics to the CPU, sacrificing CPU time to improve throughput. You need to evaluate whether to enable it based on your business scenario. If your network card supports hardware multiqueue, you can directly use it.

Homework #

There are two types of homework for this lesson. You can choose according to your own situation.

Beginner:

What will happen if the time for both software interrupts and hardware interrupts to be disabled is too long?

Advanced:

If you want to track how long a network packet stays in the kernel buffer before being read by an application, how do you think you should track it?

Feel free to discuss with me in the comments section.

Thank you for reading. If you found this lesson helpful, please feel free to share it with your friends. See you in the next lesson.