14 User Information Query How to Solve Network Soft Breakpoints Bottle Neck Problem

14 User information query how to solve network soft breakpoints bottle-neck problem #

Hello, I’m Gao Lou.

In this lesson, let’s continue to discuss another interface: User Information Query. Through this interface, let’s take a look at the impact of high network soft interrupts on TPS. Actually, the judgment on this point appears in many performance projects, and the difficulty lies in the fact that many people cannot relate soft interrupts to slow response time and the impact on TPS. Today, I will guide you through solving this problem.

At the same time, I will also explain how to perform benchmark verification purely at the network layer based on hardware configuration and software deployment, in order to determine the correctness of our judgment and propose targeted optimization solutions. The ultimate effect of our optimization will be reflected through TPS comparison.

Pressure Data #

Let’s first look at the pressure data for user information queries. Since we are currently testing a single interface and user information queries require a login status, we need to first generate some token data and then execute the user information query interface.

After preparing the token data, the first user information query is as follows:

This step is just a test, and the long duration is for troubleshooting. From the graph above, the starting point of this interface is good, reaching around 750.

However, the performance bottleneck is also obvious: the response time increases with the increase in pressure threads, and TPS has reached its limit. For such an interface, we can optimize it or not, because the current TPS of this interface meets our requirements. However, following the principle of “living is all about making trouble”, we still need to analyze the bottleneck of this interface.

Following the analytical approach we discussed earlier, let’s analyze this problem below.

See the Architecture Diagram #

From the link monitoring tool, we can pull out the architecture diagram, which is simple and direct, and we don’t need to draw it again. It’s really a must-know skill for lazy people.

From the diagram above, we can see that the path for user information query is User - Gateway - Member - MySQL.

You may ask, aren’t there Redis, MongoDB, and Monitor in the diagram? Yes, we also need to keep them in mind. This interface uses Redis, so if there is a problem or it becomes slow, we need to analyze it. MongoDB is not used, so we don’t need to worry about it. Monitor service is the Spring Boot Admin service, and we can ignore it for now. We will talk about it later when we need it.

Note that this step is an introduction for analysis to ensure that we will not get confused later.

Splitting Response Time #

In the scene data, we clearly see that the response time has slowed down, so we need to know where the slowness is. According to the architecture diagram above, we know the path of the user information query interface. Now we need to split this response time and see how long each segment takes.

If you have an APM tool, you can directly use it to check the time spent on each segment. If you don’t have one, it’s okay. As long as you can draw the architecture diagram and split the time, it doesn’t matter what method you use.

In addition, I want to mention that please don’t overly believe the advertisements of APM tool vendors. We still need to see the effectiveness. While pursuing technology, we also need to judge rationally whether it is really necessary.

Specifically, the split times are as follows:

User - Gateway

User - Gateway

Time spent on Gateway

Gateway Time

Gateway - Member

Gateway - Member

Time spent on Member

Member Time

Member to DB

Member to DB

I have organized the above split times into our architecture diagram:

Architecture Diagram

Seeing this diagram, the idea becomes very clear, right? Based on the time split in the diagram, we clearly see that more time is spent on the Member service, so the next step is to focus on the Member service.

Global Monitoring and Analysis #

As usual, let’s start by looking at the global monitoring:

Among them, worker-8 is using the most CPU, so let’s start with that.

I want to emphasize that in the mindset of global monitoring, it’s not about what data we look at, but rather what data we should look at. At this point, you must have a global counter. For example, in Kubernetes, we need to have this mindset:

In other words, we need to list all the global monitoring counters and then check them one by one.

Actually, it’s not just about listing them, there also needs to be corresponding logic. So how do we understand this logic? It depends on the basic knowledge of performance analysts. I often say that in order to do comprehensive performance analysis, you must have a solid foundation in computer knowledge, and this knowledge is extensive. I have previously drawn a diagram, and now I have made some revisions, as shown below:

The content in this picture is something we often encounter in performance analysis. Some people might say that these things are beyond the scope of skills for performance engineers. So I want to emphasize again that what I have been talking about is performance engineering. In the performance analysis of a project, I do not limit the scope of technology. As long as it is applicable, we need to use it for analysis.

Earlier, we mentioned that the CPU resources on worker-8 are being used the most, so let’s check if the service being tested, the Member service, is indeed running on worker-8.

From the above graph, it’s clear that the Member service is indeed running on worker-8.

The next step is to go into this node and take a look. If we find that all the CPU consumption is due to user CPU (us), then I think we can end this benchmark test. Because for an application, it is reasonable for us CPU to be high.

Someone who was doing third-party testing came to me and said that the client doesn’t like to see high CPU usage and asked him to find ways to lower the CPU usage. But he had no idea how to do it, so he came to ask me for advice.

I asked him what the goal of the test was. He replied that the client didn’t care about TPS, they just wanted to reduce CPU usage. I told him it was simple, just reduce the load and the CPU will naturally come down. I thought it was just a sarcastic remark, but he actually did it, and the client accepted it! Later, upon reflection, I realized that I had misdirected the development direction of the performance industry.

From a professional perspective, when dealing with clients who don’t understand, it’s best to communicate in a language that they can understand. However, we should not compromise when we shouldn’t. This is the value of professionalism, not giving the clients whatever they ask for.

Now let’s take a look at the top data on this node:

[root@k8s-worker-8 ~]# top
top - 02:32:26 up 1 day, 13:56,  3 users,  load average: 26.46, 22.37, 14.54
Tasks: 289 total,   1 running, 288 sleeping,   0 stopped,   0 zombie
%Cpu0  : 73.9 us,  9.4 sy,  0.0 ni,  3.5 id,  0.0 wa,  0.0 hi, 12.5 si,  0.7 st
%Cpu1  : 69.8 us, 12.5 sy,  0.0 ni,  4.3 id,  0.0 wa,  0.0 hi, 12.8 si,  0.7 st
%Cpu2  : 71.5 us, 12.7 sy,  0.0 ni,  4.2 id,  0.0 wa,  0.0 hi, 10.9 si,  0.7 st
%Cpu3  : 70.3 us, 11.5 sy,  0.0 ni,  6.1 id,  0.0 wa,  0.0 hi, 11.5 si,  0.7 st
KiB Mem : 16266296 total,  3803848 free,  6779796 used,  5682652 buff/cache
KiB Swap:        0 total,        0 free,        0 used.  9072592 avail Mem 
    
PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
30890 root      20   0 7791868 549328  15732 S 276.1  3.4  23:17.90 java -Dapp.id=svc-mall-member -javaagent:/opt/skywalking/agent/sky+
18934 root      20   0 3716376   1.6g  18904 S  43.9 10.3 899:31.21 java -XX:+UnlockExperimentalVMOptions -XX:+UseCGroupMemoryLimitFor+
1059 root      20   0 2576944 109856  38508 S  11.1  0.7 264:59.42 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-
1069 root      20   0 1260592 117572  29736 S  10.8  0.7 213:48.18 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.
15018 root      20   0 5943032   1.3g  16496 S   6.9  8.6 144:47.90 /usr/lib/jvm/java-1.8.0-openjdk/bin/java -Xms2g -Xmx2g -Xmn1g -Dna+
4723 root      20   0 1484184  43396  17700 S   5.9  0.3  89:53.45 calico-node -felix

In this example, we see that the soft interrupt CPU (si) is above 10%, but that’s just an instantaneous value. In the continuously fluctuating data, there are many times when the value is higher than this, indicating that the si CPU consumption is a bit high. For this kind of data, we need to pay attention to it.

Directed Monitoring and Analysis #

Let’s take a closer look at the changes in soft interrupts. Since soft interrupts consume CPU, we need to examine the soft interrupt counter:

Soft Interrupt Counter

The above image is a snapshot of a moment in time, but in actual observation, we need to look for a longer period. Please pay attention to the numbers with a white background in the image. In our observation, the greater the increase in these values, the higher the interrupt. And in my observation, I found that the largest change was in NET_RX.

However, please note that even if this interrupt is normal, it still needs to continuously increase. To determine whether it is reasonable or not, we must consider the si CPU as well. And in network interrupts, not only soft interrupts but also hard interrupts will increase continuously.

From the image above, it can be seen that network interrupts have been balanced, and there are no issues with single queue network cards. Let’s also look at the network bandwidth.

Network Bandwidth

A total of 50Mb of bandwidth is used, and the interrupt has already reached 10%. This means that the bandwidth is not fully utilized, but the interrupt is already high, indicating that we mostly have small packets in our data.

Therefore, we make the following adjustments. The adjustment direction is to increase the queue length and buffer size of the network, so that the application can receive more packets.

    -- Increase the network queue length
    net.core.netdev_max_backlog = 10000 (original value: 1000)
    - Increase the queue length for tomcat to 10000 (original value: 1000)
    server:
      port: 8083
      tomcat
        accept-count: 10000
    -- Change the number of packets a device can receive at once
    net.core.dev_weight = 128 (original value: 64)
    -- Control the microseconds the socket uses to read data packets from the waiting device queue
    net.core.busy_poll = 100 (original value: 0)
    -- Control the default size of the receive buffer used by the socket
    net.core.rmem_default = 2129920 (original value: 212992)
    net.core.rmem_max = 2129920 (original value: 212992)
    -- Busy polling
    net.core.busy_poll = 100
    This parameter controls the microseconds the socket uses to read data packets from the waiting device queue.

After a series of vigorous operations, we were full of hope. However, after checking the TPS curve again, we found that it didn’t really help. Let’s sing the song “Cool” together.

I carefully thought about the logic of sending and receiving data. Since the us CPU is high due to the upper-layer application and the si CPU is high due to multiple network card interrupts, we still need to start from the network layer. So, I verified the network bandwidth that can be achieved. First, let me list the current hardware configuration.

Hardware Configuration

We directly tested the network using iperf3. The experimental setup is as follows:

Network Test

From the above data, we can see that there is a significant difference in si when conducting pure network tests at different levels. After the network traffic passes through KVM+Kubernetes+Docker structure, the network loss is surprisingly high, and si CPU also increased significantly.

This also explains why many companies are giving up virtualization and directly using physical machines to run Kubernetes.

Since the current K8s uses the IPIP mode in the Calico plugin, considering that the BGP mode may be more efficient, we changed the IPIP mode to BGP. This step is also to reduce the soft interrupts generated by network reception.

So, what are the differences between IPIP and BGP? For IPIP, it encapsulates IP packets twice, which means using one IP layer and another IP layer as a bridge. In normal cases, IP is based on MAC and does not require a bridge. On the other hand, BGP achieves reachability through maintaining routing tables, avoiding the need for another layer. However, BGP is not a routing protocol but a vector-based protocol. If you are not familiar with the differences in principles between IPIP and BGP, I suggest you learn more about them through self-study of relevant networking fundamentals.

After changing the IPIP mode to BGP, let’s first test the difference in pure network transmission efficiency without application pressure:

Network Test Results

Based on the above test results, the specific values of bandwidth in different network modes and packet sizes are summarized as follows:

Bandwidth Summary

It can be seen that the network capability of BGP is indeed stronger. The difference is significant.

Next, let’s continue testing the interface, and the results are as follows:

Interface Test Results

Let’s also check the soft interrupt and see if it has decreased under the BGP mode:

    top - 22:34:09 up 3 days, 55 min,  2 users,  load average: 10.62, 6.18, 2.76
    Tasks: 270 total,   2 running, 268 sleeping,   0 stopped,   0 zombie
    %Cpu0  : 51.6 us, 11.5 sy,  0.0 ni, 30.0 id,  0.0 wa,  0.0 hi,  6.6 si,  0.3 st
    %Cpu1  : 54.4 us,  9.4 sy,  0.0 ni, 28.2 id,  0.0 wa,  0.0 hi,  7.7 si,  0.3 st
    %Cpu2  : 55.9 us, 11.4 sy,  0.0 ni, 26.9 id,  0.0 wa,  0.0 hi,  5.9 si,  0.0 st
    %Cpu3  : 49.0 us, 12.4 sy,  0.0 ni, 32.8 id,  0.3 wa,  0.0 hi,  5.2 si,  0.3 st
    KiB Mem : 16266296 total,  7186564 free,  4655012 used,  4424720 buff/cache
    KiB Swap:        0 total,        0 free,        0 used. 11163216 avail Mem

Optimization Effects #

From the adjustments made above, we can see that software interrupts have indeed been reduced significantly. However, we still want to see how this optimization affects TPS. Thus, let’s take a look at the effect on TPS after the optimization.

TPS Optimization Result

There is a decrease in si CPU usage:

si CPU Usage

Summary #

When we see that an interface has already met the business requirements, from a cost perspective, we shouldn’t spend time tidying it up. However, from a technical perspective, we need to understand the performance results of each interface to the extent of “knowing where the final bottleneck is”. This makes it easier for us to continue optimization in subsequent work.

In the example in this class, we started analyzing from the si cpu, through the search for software interrupts and pure network tests, we identified the network mode of Kubernetes, and then we chose a more reasonable network mode. The entire process went through a long chain, and this kind of thinking is what I always mention in my lectures as the “chain of evidence”.

Finally, I want to emphasize again that performance analysis must have a chain of evidence. Without a chain of evidence, performance analysis is just being a rogue. We need to be honest and upright drivers.

Homework #

I have two questions for you to think about:

Why do we think of testing pure network bandwidth when we see a high NET_RX interrupt?
Can you summarize the evidence chain of the case study in this lesson?

Remember to discuss and exchange your thoughts with me in the comments section. Every thought you have will help you progress further.

If you found this article helpful, feel free to share it with your friends and learn and improve together. See you in the next lesson!