12 Opening homepage second how to balance and utilize hardware resources #
Hello, I’m Gao Lou.
Regarding the performance issue with the opening the homepage interface, in the last lesson we determined that the Gateway was consuming response time, reaching almost 100 milliseconds. Therefore, we started locating the response time consumption on the Gateway.
During the first phase, we focused on the host where the application is located and found out that there are a total of four machines in the host. In the second phase, we checked the CPU mode of the physical machine and tried to optimize performance by modifying the CPU running mode. However, the problem still hasn’t been resolved, TPS hasn’t improved, and the response time remains long.
In today’s lesson, we enter the third phase and continue analyzing other bottleneck points, such as wa CPU, balanced resource utilization, network bandwidth, and other issues. Among them, in the logic of performance analysis, balanced resource utilization is an aspect that is often overlooked but extremely important. We usually focus on the abnormal values given by the counters, rather than considering how resources should be allocated accordingly.
In our case, the system uses k8s to manage resources, so we must pay attention to balanced resource utilization to avoid the situation where some services have poor performance but are allocated the same resources as well-performing services. Additionally, in k8s, network resources span multiple layers, so we should also pay special attention to them.
While studying this lesson, I suggest you think more about the issue of balanced resource utilization. Now, let’s begin today’s class.
Locating the response time consumption on the gateway #
Phase 3: High wa CPU on NFS server #
According to the analysis logic, we still start by looking at the global monitoring data, following the “global-targeted” approach, which is my usual sequence.
Therefore, let’s now take a look at the global monitoring counters and get the following view:
[root@lenvo-nfs-server ~]# top
top - 00:12:28 up 32 days, 4:22, 3 users, load average: 9.89, 7.87, 4.71
Tasks: 217 total, 1 running, 216 sleeping, 0 stopped, 0 zombie
%Cpu0 : 0.0 us, 4.0 sy, 0.0 ni, 34.8 id, 61.2 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu1 : 0.0 us, 4.7 sy, 0.0 ni, 27.8 id, 67.6 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu2 : 0.0 us, 6.1 sy, 0.0 ni, 0.0 id, 93.9 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu3 : 0.0 us, 7.6 sy, 0.0 ni, 3.4 id, 82.8 wa, 0.0 hi, 6.2 si, 0.0 st
KiB Mem : 3589572 total, 82288 free, 775472 used, 2731812 buff/cache
KiB Swap: 8388604 total, 8036400 free, 352204 used. 2282192 avail Mem
We can see that the wa CPU utilization is high, with Cpu2’s wa already reaching over 90%. We know that wa cpu refers to the percentage of CPU time occupied by IO waiting time during read and write operations. So why is it so high now? Could it be because there are a lot of write operations?
At this point, we need to pay attention to the status of IO, as slow IO is definitely a performance problem. By using the iostat command, we can see the IO status as follows:
[root@lenvo-nfs-server ~]# iostat -x -d 1
Linux 3.10.0-693.el7.x86_64 (lenvo-nfs-server) Dec 26, 2020 _x86_64_ (4 CPU)
..................
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 94.00 39.00 13444.00 19968.00 502.44 108.43 410.80 52.00 1275.59 7.52 100.00
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 18.00 137.00 173.00 17712.00 43056.00 392.05 129.46 601.10 38.80 1046.38 3.74 115.90
..................
As you can see, the IO usage has reached 100%, indicating that the IO process is very slow.
Next, let’s check the Block Size and calculate whether the current IO is random or sequential. Although most operating systems default to a Block Size of 4096, it’s always better to double-check for peace of mind.
First, let’s determine the format of the disk:
[root@lenvo-nfs-server ~]# cat /proc/mounts
...................
/dev/sda5 / xfs rw,relatime,attr2,inode64,noquota 0 0
...................
[root@lenvo-nfs-server ~]#
With the above command, we can see that the disk is in XFS format. Now, let’s use the following command to view the Block Size:
[root@lenvo-nfs-server ~]# xfs_info /dev/sda5
meta-data=/dev/sda5 isize=512 agcount=4, agsize=18991936 blks
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=0 spinodes=0
data = bsize=4096 blocks=75967744, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0 ftype=1
log =internal bsize=4096 blocks=37093, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
[root@lenvo-nfs-server ~]#
The result shows that the Block Size is 4096. At the same time, we can see that both read and write are mostly sequential, not random. Let’s calculate a data to confirm the ability to write in order. If all writes are random, then:
\(times = \frac{{43056 \times 1024}}{{4096}} = 10,764\) times
But in reality, it only wrote 173 times, so it is indeed writing in order.
The question is, how many blocks are written at a time?
\( \frac{{43056 \times 1024}}{{173 \times 4096}} \approx 62\) blocks
We conclude that 62 blocks are written at a time. From this data, it shows that the ability to write in order is still good. For conventional hard drives, when reading and writing, if there are more random writes, the write speed will be significantly slower; if there are more sequential writes, the write speed can be faster.
Did you notice? Although the current hard drive has good sequential write capability, there is still a considerable amount of waiting time. So, next, we need to check which program is doing the writing. Here we use the iotop
command to check:
Total DISK READ: 20.30 M/s | Total DISK WRITE: 24.95 M/s
Actual DISK READ: 20.30 M/s | Actual DISK WRITE: 8.27 M/s
TID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND
12180 be/4 root 2.39 M/s 16.01 M/s 0.00 % 35.94 % [nfsd]
12176 be/4 root 3.20 M/s 0.00 B/s 0.00 % 32.25 % [nfsd]
12179 be/4 root 3.03 M/s 6.43 M/s 0.00 % 32.23 % [nfsd]
12177 be/4 root 2.44 M/s 625.49 K/s 0.00 % 31.64 % [nfsd]
12178 be/4 root 2.34 M/s 1473.47 K/s 0.00 % 30.43 % [nfsd]
12174 be/4 root 2.14 M/s 72.84 K/s 0.00 % 29.90 % [nfsd]
12173 be/4 root 2.91 M/s 121.93 K/s 0.00 % 24.95 % [nfsd]
12175 be/4 root 1894.69 K/s 27.71 K/s 0.00 % 24.94 % [nfsd]
...............
We can see that the IO is coming from NFS. So where does the NFS traffic come from? From the data below, it can be seen that this traffic is coming from machines that have NFS disks mounted. This was our initial idea when deploying the application, to use NFS for IO. Since this machine has a large capacity disk, to ensure that the disk is sufficient, multiple hosts have been mounted with NFS disks.
191Mb 381Mb 572Mb 763Mb 954Mb
mqqqqqqqqqqqqqqqqqqqvqqqqqqqqqqqqqqqqqqqvqqqqqqqqqqqqqqqqqqqvqqqqqqqqqqqqqqqqqqqvqqqqqqqqqqqqqqqqqqq
172.16.106.119:nfs => 172.16.106.130:multiling-http 1.64Mb 2.04Mb 3.06Mb
<= 26.2Mb 14.5Mb 19.8Mb
172.16.106.119:nfs => 172.16.106.100:apex-mesh 1.43Mb 2.18Mb 3.79Mb
<= 25.5Mb 14.2Mb 14.4Mb
172.16.106.119:nfs => 172.16.106.195:vatp 356Kb 1.27Mb 1.35Mb
<= 9.71Mb 7.04Mb 7.41Mb
172.16.106.119:nfs => 172.16.106.56:815 7.83Kb 4.97Kb 4.81Kb
<= 302Kb 314Kb 186Kb
172.16.106.119:nfs => 172.16.106.79:device 11.0Kb 7.45Kb 7.57Kb
<= 12.4Kb 22.0Kb 28.5Kb
172.16.106.119:ssh => 172.16.100.201:cnrprotocol 2.86Kb 2.87Kb 5.81Kb
<= 184b 184b 525b
169.254.3.2:60010 => 225.4.0.2:59004 2.25Kb 2.40Kb 2.34Kb
<= 0b 0b 0b
169.254.6.2:60172 => 225.4.0.2:59004 2.25Kb 2.40Kb 2.34Kb
<= 0b 0b 0b
172.16.106.119:nfs => 172.16.106.149:986 0b 1.03Kb 976b
<= 0b 1.26Kb 1.11Kb
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
TX: cum: 37.0MB peak: 31.9Mb rates: 3.44Mb 5.50Mb 8.22Mb
RX: 188MB 106Mb 61.8Mb 36.2Mb 41.8Mb
TOTAL: 225MB 111Mb 65.2Mb 41.7Mb 50.1Mb
In Total DISK WRITE and Total DISK READ, we can see that the read and write capability is only 20M. There’s nothing we can do, since the capacity of this machine is not that good, we have to give up on the idea of unified writing. However, to ensure that the IO capability of the machine does not become a bottleneck for the application, we will try these two actions again:
- First, move the MySQL data files.
- Second, move the logs.
We then execute the scenario, hoping for a better result.
However, after reviewing the TPS and RT curves, I am sorry to find that the results have not improved. TPS is still very low and fluctuates greatly.
Apparently, our efforts have not had much effect, tragic! So, fate has forced us to enter the fourth stage.
Stage Four: Hardware resources exhausted, but TPS remains low #
What should we investigate in this stage? We still need to look at the data from global monitoring. Let’s take a look at the Overview resources of all hosts:
From the graph above, we can see that the CPU utilization of the virtual machine k8s-worker-8 is already very high, reaching 95.95%. So let’s log in to this virtual machine and take a closer look at the detailed global monitoring data:
Because the CPU is not being oversubscribed, we can clearly see that the CPU in k8s-worker-8 is exhausted. From the processes, we can see that the CPU is being consumed by the interface service we are currently testing. And on this virtual machine, there is not only the Portal process, but also many other services.
So let’s schedule the Portal service to a less busy worker, such as moving it to worker-3 (6C16G):
We get the following result:
We can see that the TPS has improved and reached nearly 300, indicating some improvement in performance. However, this data is still not as good as the result we had initially when we didn’t optimize, after all, we could reach 300 TPS at the beginning. So let’s continue to analyze where the current bottleneck lies.
Let’s start by looking at the performance data of the hosts:
Among them, the CPU utilization of worker-8 has reached 90.12%. Why is the CPU still so high? Let’s continue to run top
and take a look at the performance data of worker-8:
As you can see, the process at the top of the process table is the Gateway service, indicating that the Gateway process consumes the most CPU. In this case, we naturally want to see if the threads in this process are busy.
We can see from the above graph that all the threads in Gateway are constantly in the Runnable state, indicating that the work threads are indeed quite busy. In the previous performance data of worker-8, the si
CPU had reached around 16%. So in combination with this, let’s take a look at the real-time soft interrupt data:
We can see that the network soft interrupts continue to increase, indicating that indeed it is the network soft interrupts that are causing the increase in si
CPU. The change in network soft interrupts is what we found based on the chain of evidence. The chain of evidence is as follows:
Let’s also take a look at how big the network bandwidth is:
We can see that the network bandwidth is not high.
Based on the work threads in Gateway, soft interrupt data, and network bandwidth, it seems that Gateway is only responsible for forwarding and does not have much business logic or restrictions. So, for the reason why the TPS is not increasing, it seems that aside from the poor network forwarding capability, we cannot find any other explanation.
This approach actually requires some background knowledge, because we usually use network bandwidth to judge if the network is sufficient, but this is not enough. You need to know that in a network, when there are too many small packets, the network bandwidth is difficult to achieve linear traffic. So, even if the network bandwidth here is not very high, it will still result in an increase in network soft interrupts and the appearance of queues.
In that case, let’s move this Gateway from worker-8 to worker-2 (6C16G) to reduce contention for network soft interrupts. Let’s take a look at the overall performance of the cluster:
Looks good, the CPU utilization of worker-8 has dropped to 56.65%, while the CPU utilization of worker-3 has increased to 70.78%. However, there are a few places where the network bandwidth has turned red, which we will analyze later. At least from here, we can see increased stress.
Let’s go back and look at the stress situation:
The TPS has reached around 1000! Great achievement! Let’s celebrate with a TPS comparison graph:
Actually, at this point, we can end the benchmark scenario of opening the home page because we have already optimized it to a level higher than the requirements. However, from a technical perspective, a system will always have a limit to optimization. Therefore, we still need to know where this limit is.
Stage 5: Using up hardware resources #
Currently, the pressure has pushed the CPU usage of worker-3 to its highest level at 70.78%. So, we need to utilize the hardware resources of this machine to the fullest, because only when resources are completely used up can we determine the upper limit of the system capacity. This is why I always emphasize splitting performance optimization into two stages: first, use up resources; second, increase capacity. Even if it’s not CPU resources, using up other resources will also suffice.
Since the CPU resources of worker-3 have already reached 70.78% usage, let’s see how the threads in this application are occupying the CPU.
As you can see, the threads here are indeed busy.
In this case, let’s increase the maximum values of Tomcat and JDBC connections to 80, and see how TPS performs (please note that this is just an attempt, so you can increase it arbitrarily without any specific reason. In subsequent testing, we will make adjustments based on the actual situation; we don’t want the number of threads to be too large or insufficient).
To directly apply pressure to a single node, we will skip Ingress and directly send pressure to the service using segmented testing. Then, we will set up a node port in the Pod to proxy the service and modify the pressure script. The result is as follows:
TPS is still fluctuating greatly. Let’s take a look at the global monitoring:
From the above image, we can see that the bandwidth of a few hosts is red, but other resource usage rates are not particularly high. As mentioned earlier, when analyzing network issues, we should not only look at network bandwidth, but also analyze other factors. Therefore, we need to analyze network bandwidth.
Let’s check the traffic in the monitoring tool. You can see that there are indeed some non-test applications occupying the bandwidth, and the usage is not small:
Next, let’s take a look at the overall bandwidth and see that more than 4GB has already been used:
To find out which applications unrelated to the tested system are affecting bandwidth consumption, and thereby affecting TPS, let’s delete the applications that consume bandwidth, such as Weave Scope and the monitoring tools from the list. From the list, we can see that these applications are occupying a considerable amount of bandwidth.
Then, we test again and find that TPS has improved and has become more stable:
As you can see, TPS has now increased to around 1200, indicating that bandwidth does have a significant impact on TPS.
Next, let’s check the network queues and find that there is already a significant Recv_Q on the server where the application is hosted.
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name Timer
tcp 759 0 10.100.69.229:8085 10.100.140.32:35444 ESTABLISHED 1/java off (0.00/0/0)
tcp 832 0 10.100.69.229:34982 10.96.224.111:3306 ESTABLISHED 1/java keepalive (4871.85/0/0)
tcp 1056 0 10.100.69.229:34766 10.96.224.111:3306 ESTABLISHED 1/java keepalive (4789.93/0/0)
tcp 832 0 10.100.69.229:35014 10.96.224.111:3306 ESTABLISHED 1/java keepalive (4888.23/0/0)
tcp 3408 0 10.100.69.229:34912 10.96.224.111:3306 ESTABLISHED 1/java keepalive (4855.46/0/0)
tcp 3408 0 10.100.69.229:35386 10.96.224.111:3306 ESTABLISHED 1/java keepalive (5019.30/0/0)
tcp 3392 0 10.100.69.229:33878 10.96.224.111:3306 ESTABLISHED 1/java keepalive (4495.01/0/0)
tcp 560 0 10.100.69.229:35048 10.96.224.111:3306 ESTABLISHED 1/java keepalive (4888.23/0/0)
tcp 1664 0 10.100.69.229:34938 10.96.224.111:3306 ESTABLISHED 1/java keepalive (4855.46/0/0)
tcp 759 0 10.100.69.229:8085 10.100.140.32:35500 ESTABLISHED 1/java off (0.00/0/0)
tcp 832 0 10.100.69.229:35114 10.96.224.111:3306 ESTABLISHED 1/java keepalive (4921.00/0/0)
tcp 1056 0 10.100.69.229:34840 10.96.224.111:3306 ESTABLISHED 1/java keepalive (4822.69/0/0)
tcp 1056 0 10.100.69.229:35670 10.96.224.111:3306 ESTABLISHED 1/java keepalive (5117.60/0/0)
tcp 1664 0 10.100.69.229:34630 10.96.224.111:3306 ESTABLISHED 1/java keepalive (4757.16/0/0)
From here, it can be seen that the network has become the next bottleneck (we will discuss this further in future lessons).
If you want to continue tuning, you can start from the application code to make the application process faster. However, for a benchmark test, an interface without any caching, achieving such a high TPS on a single node virtual machine with 6C16G is already quite good.
Next, we will need to work on other interfaces, so we will end the optimization for this interface here.
Summary #
In the benchmark scenario of opening the home page, there are many aspects involved. From the initial information sorting, such as access path, code logic review, and scenario trial run, all are preparations for the subsequent analysis.
When we see a high response time and then perform the step of splitting the time, it is what I have always emphasized in RESAR performance engineering as the “starting point” of the analysis. Because before this, we have been using the data from the load testing tool, just listing them without any analysis.
There are various means we can use for splitting the time. You can use your preferred methods, such as logs, APM tools, or even packet capturing. Once we split the time, we need to analyze what to do when the response time is high at a certain node. This is where the “global-directed” monitoring and analysis approach that I have always emphasized comes into play.
In each stage, you must clearly define the direction and goals of optimization, otherwise it’s easy to lose focus. Especially for those who like to operate the mouse very quickly, it’s easy to lose track. I advise you to slow down, think carefully about the next step before taking action.
The entire process mentioned above relies on the performance analysis decision tree I mentioned. Go down the tree from the top, layer by layer, remain calm and composed, without haste.
As long as you have the determination, you can achieve it.
Homework #
Finally, I have three questions for you to think about.
-
When the st CPU is high, what should you look at?
-
When the wa CPU is high, what should you look at?
-
Why do we want to fully utilize the hardware resources?
Remember to discuss and exchange your ideas with me in the comments section. Each thought will help you move forward.
If you have gained something from reading this article, feel free to share it with your friends and learn and progress together. See you in the next lesson!