26 Stability Scenario First How to Handle the Bottleneck Issues Caused by Business Volume Accumulation

26 Stability scenario first how to handle the bottleneck issues caused by business volume accumulation #

Hello, I’m Gao Lou.

According to our RESAR performance theory, after executing benchmark scenarios and capacity scenarios, the next step is stability scenarios.

Engineers who have worked on performance projects should all have a feeling: before running stability scenarios, there is a sense of trepidation because we don’t know how the system will perform after running for a long time.

And there is also a complex aspect: in stability scenarios, analyzing problems becomes more difficult due to the long runtime. There are three main reasons for this:

(1) Analysis must have complete and ongoing counter monitoring. In stability scenarios, real-time viewing of performance counters is impractical as we can’t monitor them all the time. Moreover, the timing of when problems occur is uncertain. Therefore, when analyzing problems, we need complete and ongoing counter monitoring.

(2) The problem areas caused by the accumulation of business volume are also uncertain in the entire system.

(3) As you know, stability scenario regression testing is time-consuming. In the process of analyzing and optimizing, every time we make parameter adjustments or code changes, we always need to perform regression testing, and it takes several hours to set up the stability scenario. Therefore, even seemingly simple optimization actions in stability scenarios can consume a long time.

Based on these reasons, before running stability scenarios, we must consider which counters to monitor. If we encounter problems during stability testing and find that there are no available counters for problem analysis, it will be very unfortunate. This is a situation that is highly likely to occur, so you need to pay special attention to it.

According to the monitoring logic mentioned in Lesson 9, before executing our stability scenario, we have already listed all the counters that need to be monitored in a logic of “component - module - counter”, and we have also implemented them with corresponding tools. Everything seems to be ready. Now let’s take a look at the key points to pay attention to when running stability scenarios.

Key Points of Stability Scenarios #

In stability scenarios, there are two key points that you need to focus on: runtime and stress level.

1. Runtime #

As we mentioned earlier, capacity scenarios are meant to test the maximum capacity that a system can handle, while stability scenarios mainly focus on the performance stability of the system during long-term service and observe the cumulative effects that occur during extended periods of operation. Therefore, runtime is a very important indicator in stability scenarios.

The runtime of stability in each business system is not fixed and depends on the specific application scenario of the business system.

For most systems that cannot afford downtime for a year, they rely not on all nodes in the system running continuously, but on architectural designs that can transfer the corresponding business load to other nodes in case of an issue with any node. These architectural designs involve technologies such as DNS partitioning, scalability, and high availability.

However, for our performance projects, even if a system can run continuously without downtime, it is not feasible for stability scenarios to run continuously. This would be like running another production system all year round, which is costly and difficult to maintain, and clearly not realistic.

At this point, the importance of another role comes into play, which is operations and maintenance (O&M).

In the responsibilities of O&M, there is the task of “dealing with various issues that occur in the production environment,” which we usually refer to as being the scapegoat. The job of O&M is to ensure that the system runs smoothly in all scenarios. However, I want to add a few more words: to ensure this, it cannot solely rely on engineers in the O&M position, but requires the concerted efforts of all technical personnel in the company. In other words, the responsibilities of O&M should actually be shouldered by all technical personnel in an enterprise.

Going back to our discussion, we know that O&M will formulate various work tasks to ensure the normal operation of the system. Among these tasks, one important task is to build a comprehensive monitoring system, because it is unrealistic to expect an O&M engineer to keep an eye on the system without blinking. And the global monitoring and targeted monitoring that we mentioned in this course can fully meet the requirements of such monitoring systems.

Why do we mention O&M?

Because the runtime of stability scenarios cannot cover systems that run continuously without downtime, and this requires O&M personnel to ensure the stability of the online state. Overall, there are two main types of work tasks in O&M: one is daily inspections (checking the health status of the system manually or automatically), and the other is O&M actions (manually or automatically completing operations such as archiving and log cleaning).

Some systems have fixed O&M cycles, which are calculated on a daily, weekly, or monthly basis. For systems without fixed O&M cycles, judgment on when to perform O&M actions is based on the information provided by the monitoring system. In the case of well-developed automated O&M, some O&M actions are handled by the automation system; otherwise, it can only rely on human efforts.

However, whether there is automated O&M or not, every system has an O&M cycle, like the following example:

Now let’s take a closer look at how to calculate the runtime of stability scenarios for the two types of systems mentioned above.

  • Systems with Fixed O&M Cycles

For systems with fixed O&M cycles, it is relatively easy to define the runtime of stability scenarios. We start by using statistical data from the production system to determine the maximum business capacity within the fixed O&M cycle.

Suppose based on the production system statistics, you find that the business capacity in the previous O&M cycle was 100 million, and the maximum TPS obtained from the capacity scenario was 1000. Then, we can calculate it using the following formula:

\[ Stability Runtime = 1e8 (Cumulative Business Volume) \div 1000 (TPS) \div 3600 (seconds) \approx 28 (hours) \]

For systems with fixed maintenance cycles, the stability runtime obtained in this way is already sufficient.

  * **Systems without fixed maintenance cycles**



What should we do for systems without fixed maintenance cycles? Some might say that the runtime should be as long as possible. However, there should still be a limit to "as long as possible". Based on my experience, we cannot use "as long as possible" to determine the stability runtime.

According to the above formula, TPS comes from the capacity scenario, and time is the largest variable, so the cumulative business volume is uncertain. Now, what we need to do is determine the cumulative business volume.

We know that the **cumulative business volume needs to be determined based on statistical data from past business** . If your system has a cumulative business volume of 10 million in a month, and the stability runtime target is three months (which means that even without a fixed maintenance cycle, we need to provide a time length):- ![](../images/b990224ad9c242398eb5a173cd8e51b8.jpg)

Then, the total cumulative business volume would be 30 million.

We can calculate the stability runtime using the formula above:

\[ Stability Runtime = 30 million (Cumulative Business Volume) \div 1000 (TPS) \div 3600 (seconds) \approx 8 (hours) \]

In summary, **regardless of the type of system, it is necessary to determine a cumulative business volume in order to run stability scenarios** .

### 2. Pressure Magnitude

Let's now look at the pressure magnitude, which is another prerequisite that must be determined in stability scenarios.

We often come across statements online that say the pressure in stability scenarios should be at 80% of the maximum TPS. However, let's consider the **goal of stability scenarios: ensuring the cumulative business volume of the system** . In other words, as long as we can meet this goal, the exact value of TPS is not important.

Therefore, **we don't need to consider the 80% issue and can directly run the scenarios with the maximum TPS** . Only if a system can run normally at the maximum TPS is it truly withstanding the test.

You might have this question: what if a system running at the maximum TPS encounters sudden pressure that requires a higher TPS? Please note that stability scenarios are not designed to solve peak pressure. If you need to consider sudden surge in business pressure, I suggest adding capacity scenarios to verify.

In addition, if we need to handle sudden increases in business volume, we not only need to consider adding capacity scenarios in performance scenarios, but also include corresponding protection mechanisms such as flow control, circuit breaking, and degradation in the design of the architecture.

So far, we have covered the two important stability conditions.

Next, let's go through a specific example using our e-commerce system in this course to see how to determine stability scenarios.

Scenario Running Data #

Because this is a sample system, let’s set a small goal first: to stably accumulate a business volume of 50 million.

For this system, the maximum TPS obtained in the capacity scenario is 1700. However, as the capacity scenario continues to increase, the data volume in the database will become larger, and the TPS will gradually decrease because I haven’t implemented database capacity restrictions and archiving actions. So let’s use the corresponding pressure threads from the capacity scenario to run the stability scenario and confirm our theory when it is implemented. According to the previous calculation formula, the duration of the stability run is:

\[ Stability run time = 50 million \\div 1700 (TPS) \\div 3600 (seconds) \\approx 8.16 (hours) \]

In other words, we need to run the stability scenario for a little over 8 hours.

Now let’s take a look at the specific running data:

From the data, it can be seen that when the stability scenario has been running for more than 4 hours, the TPS disappears and the response time becomes very high, indicating that there is a problem.

At this time, the accumulated business volume is:

The total business volume is over 29 million, which does not match our expectations.

Next, let’s analyze what’s going on.

Global Monitoring and Analysis #

Following our usual performance analysis logic, let’s first take a look at the global monitoring data:

As you can see, during the runtime, several workers’ CPU resources were above 70%. This data is normal and not our main focus because in stability scenarios, as long as the resources can handle it, it’s fine.

However, in the runtime data, the TPS (transactions per second) dropped directly. When I looked at the resource situation on each host, I found this data on worker-1:

Here, the data is cut off! So we need to analyze this specific host.

Targeted Monitoring and Analysis #

Phase One of Targeted Analysis #

Based on the timestamp of the issue and the monitoring methods we previously used, we conducted a step-by-step investigation (as explained in Lesson 4). Here are the log messages we found:

Feb 20 04:20:41 hp-server kernel: Out of memory: Kill process 7569 (qemu-kvm) score 256 or sacrifice child
Feb 20 04:20:41 hp-server kernel: Killed process 7569 (qemu-kvm), UID 107, total-vm:18283204kB, anon-rss:16804564kB, file-rss:232kB, shmem-rss:16kB
Feb 20 04:20:44 hp-server kernel: br0: port 4(vnet2) entered disabled state
Feb 20 04:20:44 hp-server kernel: device vnet2 left promiscuous mode
Feb 20 04:20:44 hp-server kernel: br0: port 4(vnet2) entered disabled state
Feb 20 04:20:44 hp-server libvirtd: 2021-02-19 20:20:44.706+0000: 1397: error : qemuMonitorIO:718 : Internal error: End of file from qemu monitor
Feb 20 04:20:44 hp-server libvirtd: 2021-02-19 20:20:44.740+0000: 1397: error : qemuAgentIO:598 : Internal error: End of file from agent monitor
Feb 20 04:20:45 hp-server systemd-machined: Machine qemu-3-vm-k8s-worker-1 terminated.

Clearly, worker-1 was killed directly due to insufficient memory on the host machine. Since the issue is memory-related, we need to investigate why the host machine ran out of memory.

I checked the overcommit parameter on the host machine. This parameter determines whether the operating system allows overcommitting memory. For Linux, memory allocated may not be fully used. Therefore, allowing overcommitment on the host machine can support more virtual machines.

Here are the options for this parameter:

  • 0: Do not allow overcommitment.
  • 1: Allow allocation of all available physical memory regardless of the current memory state.
  • 2: Allow allocation of memory beyond the physical memory + swap space.

Please note that allowing overcommitment does not mean allowing overuse! In our current situation, the host machine has already encountered an OOM (Out of Memory) error, indicating that the memory is indeed insufficient.

This logic is quite interesting: Although Linux permits overcommitting memory, when the memory is truly insufficient, even if it receives a request for overcommitment, it will make the OOM decision to ensure its own normal operation. In other words, it gives you the resources, but you may not be able to make good use of them! Doesn’t this remind you of some unreliable leaders promising big things?

Unfortunately, we still need to analyze the situation rationally and find a solution.

Since worker-1, the virtual machine, was killed, let’s take a look at its memory usage:

Based on worker-1’s resource usage, if it was killed due to excessive memory usage, it should have been killed between 12:20 and 12:30, as there were no significant fluctuations in the memory curve after around 12:30.

However, why did it wait until 4:20 am? This indicates that worker-1 was killed not because of a sudden increase in memory usage on worker-1 itself, but because the host machine’s memory usage increased, causing a memory shortage. After calculating the OOM score, worker-1 was killed. So let’s go to the host machine and see which other virtual machines are running:

[root@hp-server log]# virsh list --all
 Id    名称                         状态
----------------------------------------------------
 1     vm-k8s-master-1                running
 2     vm-k8s-master-3                running
 4     vm-k8s-worker-2                running
 5     vm-k8s-worker-3                running
 6     vm-k8s-worker-4                running
 7     vm-k8s-worker-1                running

There are a total of 6 virtual machines running on the host machine. The corresponding memory usage of each virtual machine after 12:30 is as follows:

vm-k8s-worker-2:

vm-k8s-worker-3:

vm-k8s-worker-4:

vm-k8s-master-1:

vm-k8s-master-3:

Did you see it? At around 4 o’clock, there was a large memory request on worker-2.

In this situation, if we want to analyze it in detail, we should investigate where this memory request came from. However, it is more difficult to do such analysis in a stability scenario. This scenario runs for a long time and has many business components, making it difficult to split time accurately. Therefore, I suggest that you analyze it in a benchmark scenario.

Now, we cannot assert that this memory request is unreasonable. What we need to do is to keep the system running stably. So, let’s solve this problem first.

You may wonder: since it was worker-2 that requested the memory, why kill worker-1? This requires an understanding of the Linux OOM killer mechanism.

In the OOM killer mechanism, it is not that whoever uses the most memory will be killed (of course, if someone uses a lot of memory, the likelihood of being killed will also be higher), but rather, a score will be calculated and the process with the highest score will be killed.

In each process, there are three parameters: oom_adj, oom_score, and oom_score_adj. The system’s score result is recorded in oom_score. The other two are adjustment parameters: oom_adj is an old adjustment parameter that has been retained for system compatibility; oom_score_adj is a new adjustment parameter, and Linux will determine the adjustment parameter based on the process’s runtime parameters.

The runtime parameters mentioned here mainly include:

  • Runtime duration (the longer a process survives, the less likely it is to be killed)
  • CPU time consumption (the larger the CPU consumption of a process, the more likely it is to be killed)
  • Memory consumption (the larger the memory consumption of a process, the more likely it is to be killed)

These parameters combined determine which process will be killed.

In our scenario, it was worker-1 that was killed. This indicates that worker-1 had a high score.

Because there was also higher memory consumption on worker-1 earlier, let’s check how many pods are running on worker-1 and worker-2:

[root@k8s-master-1 ~]# kubectl get pods -o wide --all-namespaces| grep worker-2
default                cloud-nacos-registry-76845b5cfb-bnj76        1/1     Running            0          9h     10.100.140.8     k8s-worker-2   <none>           <none>
default                sample-webapp-755fq                          0/1     ImagePullBackOff   0          19h    10.100.140.7     k8s-worker-2   <none>           <none>
default                skywalking-es-init-4w44r                     0/1     Completed          0          15h    10.100.140.11    k8s-worker-2   <none>           <none>
default                skywalking-ui-7d7754576b-nj7sf               1/1     Running            0          9h     10.100.140.14    k8s-worker-2   <none>           <none>
default                svc-mall-auth-6ccf9fd7c9-qh7j8               1/1     Running            0          151m   10.100.140.21    k8s-worker-2   <none>           <none>
default                svc-mall-auth-6ccf9fd7c9-sblzx               1/1     Running            0          151m   10.100.140.23    k8s-worker-2   <none>           <none>
default                svc-mall-member-df566595c-9zq9k              1/1     Running            0          151m   10.100.140.19    k8s-worker-2   <none>           <none>
default                svc-mall-member-df566595c-dmj67              1/1     Running            0          151m   10.100.140.22    k8s-worker-2   <none>           <none>
kube-system            calico-node-pwsqt                            1/1     Running            8          37d    172.16.106.149   k8s-worker-2   <none>           <none>
kube-system            kube-proxy-l8xf9                             1/1     Running            15         85d    172.16.106.149   k8s-worker-2   <none>           <none>
monitoring             node-exporter-wcsj7                          2/2     Running            18         42d    172.16.106.149   k8s-worker-2   <none>           <none>
nginx-ingress          nginx-ingress-7jjv2                          1/1     Running            0          18h    10.100.140.62    k8s-worker-2   <none>           <none>
[root@k8s-master-1 ~]# kubectl get pods -o wide --all-namespaces| grep worker-1
default                mysql-min-c4f8d4599-fxwf4                    1/1     Running            0          9h     10.100.230.9     k8s-worker-1   <none>           <none>
kube-system            calico-node-tmpfl                            1/1     Running            8          37d    172.16.106.130   k8s-worker-1   <none>           <none>
kube-system            kube-proxy-fr22f                             1/1     Running            13         85d    172.16.106.130   k8s-worker-1   <none>           <none>
monitoring             alertmanager-main-0                          2/2     Running            0          162m   10.100.230.12    k8s-worker-1   <none>           <none>
monitoring             node-exporter-222c5                          2/2     Running            10         7d     172.16.106.130   k8s-worker-1   <none>           <none>
nginx-ingress          nginx-ingress-pjrkw                          1/1     Running            1          18h    10.100.230.10    k8s-worker-1   <none>           <none>
[root@k8s-master-1 ~]# 

Let’s further check the memory usage of the frequently used Pods to see how they are doing:

PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                                  
 7609 27        20   0   12.4g   7.0g  12896 S 118.9 45.0 167:38.02 /opt/rh/rh-mysql57/root/usr/libexec/mysqld --defaults-file=/etc/my.cnf  

By looking at the processes on worker-1, we found that MySQL is using the most memory, which is a heavy consumer of memory. It is indeed possible to kill worker-1 if the host does not have enough memory.

Next, let’s add a few temporary monitors to record the memory usage of some important services, such as Gateway, Member, MySQL, Redis, etc. Then we will restore all the applications and run the scenario to see the results:

memory

The runtime has been almost seven hours. You may wonder why the previous scenario ran for only over four hours, while now it can run for seven hours? This is because worker-1 was killed and the virtual machine was restarted, so the state was reset.

Before the previous scenario was run, we did not restart the virtual machine, which means there was already a period of memory consumption in the previous scenario. For stability scenarios, there are all kinds of CRUD operations, and the amount of data is increasing, so the memory usage will increase over time.

This time, the accumulated business volume is more than 32 million:

business volume

However, a problem occurred again: by checking the host’s logs, I found that worker-2 was killed again:

Feb 20 19:42:44 hp-server kernel: Out of memory: Kill process 7603 (qemu-kvm) score 257 or sacrifice child
Feb 20 19:42:44 hp-server kernel: Killed process 7603 (qemu-kvm), UID 107, total-vm:17798976kB, anon-rss:16870472kB, file-rss:0kB, shmem-rss:16kB
Feb 20 19:42:46 hp-server kernel: br0: port 5(vnet3) entered disabled state
Feb 20 19:42:46 hp-server kernel: device vnet3 left promiscuous mode
Feb 20 19:42:46 hp-server kernel: br0: port 5(vnet3) entered disabled state
Feb 20 19:42:46 hp-server systemd-machined: Machine qemu-4-vm-k8s-worker-2 terminated.
Feb 20 19:42:46 hp-server avahi-daemon[953]: Withdrawing address record for fe80::fc54:ff:fe5e:dded on vnet3.
Feb 20 19:42:46 hp-server avahi-daemon[953]: Withdrawing workstation service for vnet3.
[root@hp-server log]# 

In other words, in the situation where there is not enough memory, which worker gets killed is not fixed. At least this can indicate that the host machine really kills the virtual machine because it doesn’t have enough memory. This may not be related to specific components, because the memory consumption of components is based on runtime requirements and is reasonable.

Why make such a judgment? Because if a fixed worker is killed, we can monitor the technical components running on this worker to see which component’s memory is increasing rapidly, and then further determine the reason for the continuous increase in memory for this technical component.

However, the worker that gets killed now is not fixed. According to the logic of OOM (Out of Memory), the operating system of the host machine only calls the OOM killer when there is not enough memory. As we mentioned earlier, the overcommit parameter is set to 1, which means that the host machine allows memory to be overcommitted when requested.

But when the host machine actually uses memory and there is not enough, it causes the virtual machine to be killed. This means that when the host machine creates the KVM virtual machine, it has overcommitted but has not provided enough available memory. In the process of sustained pressure, the virtual machine indeed needs these memory. Therefore, the virtual machine keeps requesting memory from the host machine, but the host machine does not have enough memory, triggering the OOM killer mechanism.

In this case, we need to calculate how much overcommitment has occurred to see if the overcommitment we configured is too large and causes this problem. Let’s list the memory of the virtual machine:

![](../images/bb13f78e31734e089dae8cfcde0ec6cb.jpg)

Let’s calculate the total allocated memory:

Total Allocated Memory = 8 * 2 + 16 * 4 = 80G

However, the host machine’s physical memory is:

[root@hp-server log]# cat /proc/meminfo|grep Total
MemTotal:       65675952 kB
SwapTotal:             0 kB
VmallocTotal:   34359738367 kB
CmaTotal:              0 kB
HugePages_Total:       0
[root@hp-server log]#

In other words, the maximum physical memory of the host machine is only about 65G. This is also not surprising, as physical memory is not enough when actually used.

Now let’s reduce the memory of the virtual machine so that it does not generate overcommitment. The configuration is as follows:

![](../images/67e1d2bb4edd45e1a7847be99d60276d.jpg)

The total allocated memory is calculated as follows:

Total Allocated Memory = 4 * 2 + 13 * 4 = 60G

This should be sufficient.

However, according to the principle of trade-off between time and space in performance analysis, this may cause a decrease in TPS. Because when the memory of the operating system in the virtual machine is reduced, page faults, which are caused by page swapping, will occur earlier. However, if it’s just page swapping and not OOM, at least it won’t cause the virtual machine to be killed.

Let’s run the scenario again and see the results:

![](../images/00bd16b4afca4e0cb0cdcc7e0056693d.jpg)

This result looks good. Although TPS drops at times, it is stable overall. The running time has also exceeded 12 hours.

Let’s look at the accumulated business volume:

![](../images/c983c6a855af41779b67e663d94bb856.jpg)

This time, the accumulated business volume exceeded 72 million, surpassing our small goal of 50 million. Now can we celebrate?

Don’t celebrate too soon. In the next lesson, you will experience the ups and downs of performance projects.

Summary #

Today, we discussed two key points about stability scenarios: runtime duration and stress level. To make stability scenarios meaningful, these two points are essential prerequisites.

Also, keep in mind that stability scenarios aim to identify issues that arise during the process of accumulating business. Therefore, if the amount of business accumulation does not meet the requirements for production, it cannot be said that stability scenarios are meaningful.

In addition, in this lesson, we also analyzed the issue of increasing physical memory. In terms of memory usage, especially in this Kubernetes+Docker architecture, resource allocation is crucial. Don’t think that Kubernetes automatically handles a lot of allocation work for us, and we can just sit back and relax. You will find that there are still many new challenges waiting for us.

Homework #

That’s all for today’s content. Let me leave you with two reflection questions:

  1. How can you implement the stability concepts learned in this lesson in your project?
  2. When troubleshooting stability issues, how do you design monitoring strategies to ensure we can collect sufficient analytical data? How do you handle this in your project?

Remember to discuss and exchange your ideas with me in the comments section. Each reflection will help you make further progress.

If you have gained something from this lesson, feel free to share it with your friends and learn and progress together. See you in the next lesson!