12 Opening Homepage Second How to Balance and Utilize Hardware Resources

12 Opening homepage second how to balance and utilize hardware resources #

Hello, I’m Gao Lou.

Regarding the performance issue with the opening the homepage interface, in the last lesson we determined that the Gateway was consuming response time, reaching almost 100 milliseconds. Therefore, we started locating the response time consumption on the Gateway.

During the first phase, we focused on the host where the application is located and found out that there are a total of four machines in the host. In the second phase, we checked the CPU mode of the physical machine and tried to optimize performance by modifying the CPU running mode. However, the problem still hasn’t been resolved, TPS hasn’t improved, and the response time remains long.

In today’s lesson, we enter the third phase and continue analyzing other bottleneck points, such as wa CPU, balanced resource utilization, network bandwidth, and other issues. Among them, in the logic of performance analysis, balanced resource utilization is an aspect that is often overlooked but extremely important. We usually focus on the abnormal values given by the counters, rather than considering how resources should be allocated accordingly.

In our case, the system uses k8s to manage resources, so we must pay attention to balanced resource utilization to avoid the situation where some services have poor performance but are allocated the same resources as well-performing services. Additionally, in k8s, network resources span multiple layers, so we should also pay special attention to them.

While studying this lesson, I suggest you think more about the issue of balanced resource utilization. Now, let’s begin today’s class.

Locating the response time consumption on the gateway #

Phase 3: High wa CPU on NFS server #

According to the analysis logic, we still start by looking at the global monitoring data, following the “global-targeted” approach, which is my usual sequence.

Therefore, let’s now take a look at the global monitoring counters and get the following view:

[root@lenvo-nfs-server ~]# top
top - 00:12:28 up 32 days,  4:22,  3 users,  load average: 9.89, 7.87, 4.71
Tasks: 217 total,   1 running, 216 sleeping,   0 stopped,   0 zombie
%Cpu0  :  0.0 us,  4.0 sy,  0.0 ni, 34.8 id, 61.2 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  :  0.0 us,  4.7 sy,  0.0 ni, 27.8 id, 67.6 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu2  :  0.0 us,  6.1 sy,  0.0 ni,  0.0 id, 93.9 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu3  :  0.0 us,  7.6 sy,  0.0 ni,  3.4 id, 82.8 wa,  0.0 hi,  6.2 si,  0.0 st
KiB Mem :  3589572 total,    82288 free,   775472 used,  2731812 buff/cache
KiB Swap:  8388604 total,  8036400 free,   352204 used.  2282192 avail Mem

We can see that the wa CPU utilization is high, with Cpu2’s wa already reaching over 90%. We know that wa cpu refers to the percentage of CPU time occupied by IO waiting time during read and write operations. So why is it so high now? Could it be because there are a lot of write operations?

At this point, we need to pay attention to the status of IO, as slow IO is definitely a performance problem. By using the iostat command, we can see the IO status as follows:

[root@lenvo-nfs-server ~]# iostat -x -d 1
Linux 3.10.0-693.el7.x86_64 (lenvo-nfs-server) 	Dec 26, 2020 	_x86_64_	(4 CPU)
..................
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00   94.00   39.00 13444.00 19968.00   502.44   108.43  410.80   52.00 1275.59   7.52 100.00


Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00    18.00  137.00  173.00 17712.00 43056.00   392.05   129.46  601.10   38.80 1046.38   3.74 115.90
..................

As you can see, the IO usage has reached 100%, indicating that the IO process is very slow.

Next, let’s check the Block Size and calculate whether the current IO is random or sequential. Although most operating systems default to a Block Size of 4096, it’s always better to double-check for peace of mind.

First, let’s determine the format of the disk:

[root@lenvo-nfs-server ~]# cat /proc/mounts
...................
/dev/sda5 / xfs rw,relatime,attr2,inode64,noquota 0 0
...................
[root@lenvo-nfs-server ~]#

With the above command, we can see that the disk is in XFS format. Now, let’s use the following command to view the Block Size:

[root@lenvo-nfs-server ~]# xfs_info /dev/sda5
meta-data=/dev/sda5              isize=512    agcount=4, agsize=18991936 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=0 spinodes=0
data     =                       bsize=4096   blocks=75967744, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
log      =internal               bsize=4096   blocks=37093, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
[root@lenvo-nfs-server ~]#

The result shows that the Block Size is 4096. At the same time, we can see that both read and write are mostly sequential, not random. Let’s calculate a data to confirm the ability to write in order. If all writes are random, then:

\(times = \frac{{43056 \times 1024}}{{4096}} = 10,764\) times

But in reality, it only wrote 173 times, so it is indeed writing in order.

The question is, how many blocks are written at a time?

\( \frac{{43056 \times 1024}}{{173 \times 4096}} \approx 62\) blocks

We conclude that 62 blocks are written at a time. From this data, it shows that the ability to write in order is still good. For conventional hard drives, when reading and writing, if there are more random writes, the write speed will be significantly slower; if there are more sequential writes, the write speed can be faster.

Did you notice? Although the current hard drive has good sequential write capability, there is still a considerable amount of waiting time. So, next, we need to check which program is doing the writing. Here we use the iotop command to check:

Total DISK READ: 20.30 M/s | Total DISK WRITE: 24.95 M/s
Actual DISK READ: 20.30 M/s | Actual DISK WRITE: 8.27 M/s
TID  PRIO  USER  DISK READ  DISK WRITE  SWAPIN  IO    COMMAND
12180 be/4 root    2.39 M/s   16.01 M/s   0.00 % 35.94 % [nfsd]
12176 be/4 root    3.20 M/s    0.00 B/s   0.00 % 32.25 % [nfsd]
12179 be/4 root    3.03 M/s    6.43 M/s   0.00 % 32.23 % [nfsd]
12177 be/4 root    2.44 M/s  625.49 K/s   0.00 % 31.64 % [nfsd]
12178 be/4 root    2.34 M/s   1473.47 K/s  0.00 % 30.43 % [nfsd]
12174 be/4 root    2.14 M/s    72.84 K/s  0.00 % 29.90 % [nfsd]
12173 be/4 root    2.91 M/s   121.93 K/s  0.00 % 24.95 % [nfsd]
12175 be/4 root    1894.69 K/s  27.71 K/s  0.00 % 24.94 % [nfsd]
...............

We can see that the IO is coming from NFS. So where does the NFS traffic come from? From the data below, it can be seen that this traffic is coming from machines that have NFS disks mounted. This was our initial idea when deploying the application, to use NFS for IO. Since this machine has a large capacity disk, to ensure that the disk is sufficient, multiple hosts have been mounted with NFS disks.

                        191Mb               381Mb               572Mb               763Mb          954Mb
mqqqqqqqqqqqqqqqqqqqvqqqqqqqqqqqqqqqqqqqvqqqqqqqqqqqqqqqqqqqvqqqqqqqqqqqqqqqqqqqvqqqqqqqqqqqqqqqqqqq
172.16.106.119:nfs                   => 172.16.106.130:multiling-http        1.64Mb  2.04Mb  3.06Mb
                                     <=                                      26.2Mb  14.5Mb  19.8Mb
172.16.106.119:nfs                   => 172.16.106.100:apex-mesh             1.43Mb  2.18Mb  3.79Mb
                                     <=                                      25.5Mb  14.2Mb  14.4Mb
172.16.106.119:nfs                   => 172.16.106.195:vatp                   356Kb  1.27Mb  1.35Mb
                                     <=                                      9.71Mb  7.04Mb  7.41Mb
172.16.106.119:nfs                   => 172.16.106.56:815                    7.83Kb  4.97Kb  4.81Kb
                                     <=                                       302Kb   314Kb   186Kb
172.16.106.119:nfs                   => 172.16.106.79:device                 11.0Kb  7.45Kb  7.57Kb
                                     <=                                      12.4Kb  22.0Kb  28.5Kb
172.16.106.119:ssh                   => 172.16.100.201:cnrprotocol           2.86Kb  2.87Kb  5.81Kb
                                     <=                                       184b    184b    525b
169.254.3.2:60010                    => 225.4.0.2:59004                      2.25Kb  2.40Kb  2.34Kb
                                     <=                                         0b      0b      0b
169.254.6.2:60172                    => 225.4.0.2:59004                      2.25Kb  2.40Kb  2.34Kb
                                     <=                                         0b	0b	0b
172.16.106.119:nfs                   => 172.16.106.149:986                      0b   1.03Kb   976b
                                     <=                                         0b   1.26Kb  1.11Kb


qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
TX:             cum:   37.0MB   peak:   31.9Mb                      rates:   3.44Mb  5.50Mb  8.22Mb
RX:                     188MB            106Mb                               61.8Mb  36.2Mb  41.8Mb
TOTAL:                  225MB            111Mb                               65.2Mb  41.7Mb  50.1Mb

In Total DISK WRITE and Total DISK READ, we can see that the read and write capability is only 20M. There’s nothing we can do, since the capacity of this machine is not that good, we have to give up on the idea of unified writing. However, to ensure that the IO capability of the machine does not become a bottleneck for the application, we will try these two actions again:

First, move the MySQL data files.
Second, move the logs.

We then execute the scenario, hoping for a better result.

However, after reviewing the TPS and RT curves, I am sorry to find that the results have not improved. TPS is still very low and fluctuates greatly.

Apparently, our efforts have not had much effect, tragic! So, fate has forced us to enter the fourth stage.

Stage Four: Hardware resources exhausted, but TPS remains low #

What should we investigate in this stage? We still need to look at the data from global monitoring. Let’s take a look at the Overview resources of all hosts:

From the graph above, we can see that the CPU utilization of the virtual machine k8s-worker-8 is already very high, reaching 95.95%. So let’s log in to this virtual machine and take a closer look at the detailed global monitoring data:

Because the CPU is not being oversubscribed, we can clearly see that the CPU in k8s-worker-8 is exhausted. From the processes, we can see that the CPU is being consumed by the interface service we are currently testing. And on this virtual machine, there is not only the Portal process, but also many other services.

So let’s schedule the Portal service to a less busy worker, such as moving it to worker-3 (6C16G):

We get the following result:

We can see that the TPS has improved and reached nearly 300, indicating some improvement in performance. However, this data is still not as good as the result we had initially when we didn’t optimize, after all, we could reach 300 TPS at the beginning. So let’s continue to analyze where the current bottleneck lies.

Let’s start by looking at the performance data of the hosts:

Among them, the CPU utilization of worker-8 has reached 90.12%. Why is the CPU still so high? Let’s continue to run top and take a look at the performance data of worker-8:

As you can see, the process at the top of the process table is the Gateway service, indicating that the Gateway process consumes the most CPU. In this case, we naturally want to see if the threads in this process are busy.

We can see from the above graph that all the threads in Gateway are constantly in the Runnable state, indicating that the work threads are indeed quite busy. In the previous performance data of worker-8, the si CPU had reached around 16%. So in combination with this, let’s take a look at the real-time soft interrupt data:

We can see that the network soft interrupts continue to increase, indicating that indeed it is the network soft interrupts that are causing the increase in si CPU. The change in network soft interrupts is what we found based on the chain of evidence. The chain of evidence is as follows:

Let’s also take a look at how big the network bandwidth is:

We can see that the network bandwidth is not high.

Based on the work threads in Gateway, soft interrupt data, and network bandwidth, it seems that Gateway is only responsible for forwarding and does not have much business logic or restrictions. So, for the reason why the TPS is not increasing, it seems that aside from the poor network forwarding capability, we cannot find any other explanation.

This approach actually requires some background knowledge, because we usually use network bandwidth to judge if the network is sufficient, but this is not enough. You need to know that in a network, when there are too many small packets, the network bandwidth is difficult to achieve linear traffic. So, even if the network bandwidth here is not very high, it will still result in an increase in network soft interrupts and the appearance of queues.

In that case, let’s move this Gateway from worker-8 to worker-2 (6C16G) to reduce contention for network soft interrupts. Let’s take a look at the overall performance of the cluster:

Looks good, the CPU utilization of worker-8 has dropped to 56.65%, while the CPU utilization of worker-3 has increased to 70.78%. However, there are a few places where the network bandwidth has turned red, which we will analyze later. At least from here, we can see increased stress.

Let’s go back and look at the stress situation:

The TPS has reached around 1000! Great achievement! Let’s celebrate with a TPS comparison graph:

Actually, at this point, we can end the benchmark scenario of opening the home page because we have already optimized it to a level higher than the requirements. However, from a technical perspective, a system will always have a limit to optimization. Therefore, we still need to know where this limit is.

Stage 5: Using up hardware resources #

Currently, the pressure has pushed the CPU usage of worker-3 to its highest level at 70.78%. So, we need to utilize the hardware resources of this machine to the fullest, because only when resources are completely used up can we determine the upper limit of the system capacity. This is why I always emphasize splitting performance optimization into two stages: first, use up resources; second, increase capacity. Even if it’s not CPU resources, using up other resources will also suffice.

Since the CPU resources of worker-3 have already reached 70.78% usage, let’s see how the threads in this application are occupying the CPU.

As you can see, the threads here are indeed busy.

In this case, let’s increase the maximum values of Tomcat and JDBC connections to 80, and see how TPS performs (please note that this is just an attempt, so you can increase it arbitrarily without any specific reason. In subsequent testing, we will make adjustments based on the actual situation; we don’t want the number of threads to be too large or insufficient).

To directly apply pressure to a single node, we will skip Ingress and directly send pressure to the service using segmented testing. Then, we will set up a node port in the Pod to proxy the service and modify the pressure script. The result is as follows:

TPS is still fluctuating greatly. Let’s take a look at the global monitoring:

From the above image, we can see that the bandwidth of a few hosts is red, but other resource usage rates are not particularly high. As mentioned earlier, when analyzing network issues, we should not only look at network bandwidth, but also analyze other factors. Therefore, we need to analyze network bandwidth.

Let’s check the traffic in the monitoring tool. You can see that there are indeed some non-test applications occupying the bandwidth, and the usage is not small:

Next, let’s take a look at the overall bandwidth and see that more than 4GB has already been used:

To find out which applications unrelated to the tested system are affecting bandwidth consumption, and thereby affecting TPS, let’s delete the applications that consume bandwidth, such as Weave Scope and the monitoring tools from the list. From the list, we can see that these applications are occupying a considerable amount of bandwidth.

Then, we test again and find that TPS has improved and has become more stable:

As you can see, TPS has now increased to around 1200, indicating that bandwidth does have a significant impact on TPS.

Next, let’s check the network queues and find that there is already a significant Recv_Q on the server where the application is hosted.

Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name     Timer
tcp      759      0 10.100.69.229:8085      10.100.140.32:35444     ESTABLISHED 1/java               off (0.00/0/0)
tcp      832      0 10.100.69.229:34982     10.96.224.111:3306      ESTABLISHED 1/java               keepalive (4871.85/0/0)
tcp     1056      0 10.100.69.229:34766     10.96.224.111:3306      ESTABLISHED 1/java               keepalive (4789.93/0/0)
tcp      832      0 10.100.69.229:35014     10.96.224.111:3306      ESTABLISHED 1/java               keepalive (4888.23/0/0)
tcp     3408      0 10.100.69.229:34912     10.96.224.111:3306      ESTABLISHED 1/java               keepalive (4855.46/0/0)
tcp     3408      0 10.100.69.229:35386     10.96.224.111:3306      ESTABLISHED 1/java               keepalive (5019.30/0/0)
tcp     3392      0 10.100.69.229:33878     10.96.224.111:3306      ESTABLISHED 1/java               keepalive (4495.01/0/0)
tcp      560      0 10.100.69.229:35048     10.96.224.111:3306      ESTABLISHED 1/java               keepalive (4888.23/0/0)
tcp     1664      0 10.100.69.229:34938     10.96.224.111:3306      ESTABLISHED 1/java               keepalive (4855.46/0/0)
tcp      759      0 10.100.69.229:8085      10.100.140.32:35500     ESTABLISHED 1/java               off (0.00/0/0)
tcp      832      0 10.100.69.229:35114     10.96.224.111:3306      ESTABLISHED 1/java               keepalive (4921.00/0/0)
tcp     1056      0 10.100.69.229:34840     10.96.224.111:3306      ESTABLISHED 1/java               keepalive (4822.69/0/0)
tcp     1056      0 10.100.69.229:35670     10.96.224.111:3306      ESTABLISHED 1/java               keepalive (5117.60/0/0)
tcp     1664      0 10.100.69.229:34630     10.96.224.111:3306      ESTABLISHED 1/java               keepalive (4757.16/0/0)

From here, it can be seen that the network has become the next bottleneck (we will discuss this further in future lessons).

If you want to continue tuning, you can start from the application code to make the application process faster. However, for a benchmark test, an interface without any caching, achieving such a high TPS on a single node virtual machine with 6C16G is already quite good.

Next, we will need to work on other interfaces, so we will end the optimization for this interface here.

Summary #

In the benchmark scenario of opening the home page, there are many aspects involved. From the initial information sorting, such as access path, code logic review, and scenario trial run, all are preparations for the subsequent analysis.

When we see a high response time and then perform the step of splitting the time, it is what I have always emphasized in RESAR performance engineering as the “starting point” of the analysis. Because before this, we have been using the data from the load testing tool, just listing them without any analysis.

There are various means we can use for splitting the time. You can use your preferred methods, such as logs, APM tools, or even packet capturing. Once we split the time, we need to analyze what to do when the response time is high at a certain node. This is where the “global-directed” monitoring and analysis approach that I have always emphasized comes into play.

In each stage, you must clearly define the direction and goals of optimization, otherwise it’s easy to lose focus. Especially for those who like to operate the mouse very quickly, it’s easy to lose track. I advise you to slow down, think carefully about the next step before taking action.

The entire process mentioned above relies on the performance analysis decision tree I mentioned. Go down the tree from the top, layer by layer, remain calm and composed, without haste.

As long as you have the determination, you can achieve it.

Homework #

Finally, I have three questions for you to think about.

When the st CPU is high, what should you look at?
When the wa CPU is high, what should you look at?
Why do we want to fully utilize the hardware resources?

Remember to discuss and exchange your ideas with me in the comments section. Each thought will help you move forward.

If you have gained something from reading this article, feel free to share it with your friends and learn and progress together. See you in the next lesson!