29 Abnormal Scenarios How to Simulate Exceptions at Different Component Levels

29 Abnormal scenarios how to simulate exceptions at different component levels #

Hi, I’m Gao Lou.

In the previous lesson, we discussed that it is not possible to cover every detailed exception scenario in a specific project. Besides the high cost, the benefits may not be significant.

However, when determining the scope of exception scenarios and designing them, we still need to ensure that all components at every level of the architecture are covered and not overlooked. This requires the designer of the exception scenarios to have a sufficient understanding of the architecture, which is also a difficult aspect of the design.

In the current technology market, application exceptions, operating system exceptions, container exceptions, and virtual machine exceptions are several exception scenarios that people often consider. These scenarios also simulate the very common exception problems in microservices distributed architecture.

Therefore, in this lesson, I will guide you to address some design challenges starting from these exception scenarios.

Application Exception #

In the exception scenarios of an application, I will use two examples: flow control and degradation circuit breaking to explain to you. In traditional exception scenarios, these two scenarios did not exist. However, in the rapidly developing technology market of microservices, flow control and degradation circuit breaking are essential.

Here, I choose to use the Sentinel tool (if you are not familiar with the tool itself, please search for resources online). It is a traffic protection component designed for distributed service architecture. It mainly helps developers ensure the stability of microservices from multiple dimensions such as flow control, circuit breaking and degradation, system load protection, and hotspot protection.

Of course, you can also use other tools to achieve the same, as long as it conforms to the design we discussed in the previous lesson:

Flow Control #

For flow control, we first need to know the amount of traffic in the system. In our exception scenario, we can see the traffic of each service through real-time monitoring.

In our architecture, the Order service is obviously a key service, so I will use the Order service for demonstration.

To make the flow control take effect, we need to limit the TPS generated by the pressure to be lower than the configuration of the flow control. According to the optimization results in Lesson 21, we know that the TPS of querying the order list before payment can reach 700-800 TPS. Now, I set the flow control to be below 100. The rule configuration is as follows:

After configuring, let’s take a look at the real-time traffic of the Order service:

As you can see, the QPS passed by the Order service is indeed limited to 100. 100 is the threshold set when we configured the flow control rule. Let me explain: the QPS in this tool refers to the number of requests per second, and in the pressure tool, I directly use one request to correspond to one transaction. So, you can consider QPS as TPS.

Looking at the values of “Passed QPS” and “Rejected QPS” in the above image, we can see that only about 20% of requests have passed, and the rest have been rejected. Corresponding to the pressure tool, the TPS graph of querying the order list before payment is as follows:

As you can see, the TPS has dropped significantly. However, we also see a large number of errors. At this point, we need to pay attention to whether these errors are reasonable.

If it is an end user, they should see a prompt like “The system is busy, please try again later” instead of an “HTTP error code”. So, if the error is caused by assertions in our script, then we need to modify the judgment content of the assertions. In my script, since I only assert HTTP 200, I will receive errors for other HTTP response codes, which is why we see many errors in the Errors graph.

If you want to handle such errors, you can add logic in the code to return friendly responses. But now we are analyzing performance issues, so we only need to provide optimization suggestions to the developers for this feature.

What about system resources?

As we can see, the usage of system resources has also decreased. This achieves the effect of flow control, and at least we ensure that this service will not crash.

Corresponding to the data in Sentinel, when we delete the flow control rule, we can see that requests can also be restored:

So, in our case, flow control is effective, and the effect is quite obvious.

Degradation Circuit Breaking #

Let’s take a look at a case study on degradation and circuit breaking.

I have set the maximum response time for the Portal service to 10ms. Please note that in this case study, I am using the “open homepage” business, and the response time for opening the homepage is always greater than 10ms. So, let’s see if the degradation rule is effective.

First, let’s configure the degradation rule. The main parameters are:

  • Resource name, which is the service name.
  • Maximum response time (RT).
  • Slow request ratio threshold, when the ratio of slow requests exceeds the threshold, circuit breaking will be triggered.
  • Circuit breaking duration, the duration of the circuit breaking after which the transactions per second (TPS) will recover.
  • Minimum number of requests, that is, how many requests are allowed to pass.

img1

Next, let’s run the stress test for opening the homepage and keep it running for a while to see what the TPS curve looks like:

img2

The result is clear. After running the scenario for a while, because the response time is greater than the maximum response time of 10ms set in the degradation rule, all requests are subsequently circuit broken for 10 seconds. During these 10 seconds, TPS drops to zero and errors are reported.

Now let’s look at the TPS curve in Sentinel to see if it matches the TPS curve in the previous graph:

img3

As we can see, the TPS curve in Sentinel is consistent with the TPS curve shown in the previous image, indicating that the degradation and circuit breaking rules are indeed taking effect. Because there are no requests with a response time less than 10ms, the system stays in the circuit-breaking state and keeps reporting errors. After deleting the rules, the TPS recovers.

Let’s set the maximum response time to 30ms again, because the average response time for opening the homepage is below 30ms. Let’s see how the degradation and circuit breaking work in this case. Note that the circuit-breaking duration is set to 10 seconds.

img4

Now let’s look at the corresponding TPS graph:

img5

The errors occur intermittently; TPS sometimes drops, but there are also times of recovery. This is because when the response time for opening the homepage exceeds 30ms, TPS is interrupted once and stays circuit broken for 10s, and then it recovers. When it recovers, it detects requests with a response time greater than 30ms and continues to circuit break. Therefore, we see such results.

This indicates that the degradation and circuit breaking rules we set are effective.

In the two examples of rate limiting and degradation and circuit breaking mentioned above, there are two points where you need to make judgments:

  1. Whether the rules are effective.
  2. What interface the end users see. If it is not a user-friendly interface, you can report a bug.

Now let’s simulate some operating system-level exceptions.

Operating System Level Exception #

We know that the operating system has several levels, including physical machine and virtual machine. In addition, some enterprise Pods also have full operating systems. Here, we use the operating system at the level of virtual machines (also known as our worker machines) to simulate exceptions. If you want to do everything in the project, you can also use the same logic to simulate each level of the operating system.

Here I use CPU, memory, network, and disk to simulate exceptions in the operating system because these are the most important resources in the operating system.

CPU Exception #

Let’s start with CPU exceptions.

Please note that when simulating CPU exceptions, we must know which aspect we are simulating.

If you want to simulate high CPU consumption in the application itself, you need to modify the code. If the CPU is already high without modifying the code, it is an obvious bug. We have already written about how to handle such bugs in the 22nd lesson. You can review how to handle them.

There are two cases for simulating CPU exceptions:

  1. Simulating exceptions of threads competing for CPU in the application;
  2. Other processes on the same machine competing for CPU with the tested business process.

Here, let’s simulate the exception of CPU being occupied by other processes.

First, let’s check the current CPU consumption:

%Cpu0  : 46.4 us,  2.7 sy,  0.0 ni, 48.8 id,  0.0 wa,  0.0 hi,  2.0 si,  0.0 st
%Cpu1  : 29.4 us,  4.2 sy,  0.0 ni, 64.0 id,  0.0 wa,  0.0 hi,  2.4 si,  0.0 st
%Cpu2  : 37.8 us,  3.8 sy,  0.0 ni, 55.6 id,  0.0 wa,  0.0 hi,  2.4 si,  0.3 st
%Cpu3  : 26.0 us,  4.6 sy,  0.0 ni, 67.4 id,  0.0 wa,  0.0 hi,  1.8 si,  0.4 st
%Cpu4  : 33.7 us,  4.8 sy,  0.0 ni, 59.1 id,  0.0 wa,  0.0 hi,  2.4 si,  0.0 st
%Cpu5  : 29.9 us,  3.8 sy,  0.0 ni, 63.6 id,  0.0 wa,  0.0 hi,  2.7 si,  0.0 st

From the above data, we can see that under the current stress scenario, the us CPU usage is around 30%, and id CPU usage is around 60%. Obviously, the operating system still has idle CPU resources.

Next, we can use the stress command to simulate CPU exhaustion. I plan to occupy all 6 CPUs:

stress -c 6 -t 100

Then, let’s use the top command to see the effect:

%Cpu0  : 97.3 us,  2.3 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.3 si,  0.0 st
%Cpu1  : 93.5 us,  2.4 sy,  0.0 ni,  2.4 id,  0.0 wa,  0.0 hi,  1.7 si,  0.0 st
%Cpu2  : 98.0 us,  1.3 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.7 si,  0.0 st
%Cpu3  : 98.0 us,  1.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  1.0 si,  0.0 st
%Cpu4  : 97.7 us,  1.3 sy,  0.0 ni,  0.3 id,  0.0 wa,  0.0 hi,  0.7 si,  0.0 st
%Cpu5  : 94.2 us,  3.1 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  2.7 si,  0.0 st

As you can see, the us CPU usage is very high!

Let’s execute the vmstat command again to compare the data before and after the simulation:

Before simulation:
[root@k8s-worker-6 ~]# vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
10  0      0 6804936    140 4640292    0    0     1     5    1    2 12  3 85  0  0
 3  0      0 6806228    140 4640336    0    0     0     0 12290 15879 21  5 74  0  0
 1  0      0 6806972    140 4640336    0    0     0     0 11070 13751 24  5 71  0  0
 1  0      0 6808124    140 4640416    0    0     0     9 10944 13165 27  5 68  0  0
 6  0      0 6806400    140 4640504    0    0     0     0 11591 14836 24  6 71  0  0
11  0      0 6801328    140 4640516    0    0     0     0 11409 13859 31  6 63  0  0
Simulation:
[root@k8s-worker-6 ~]# vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
27  0      0 7072940    140 4363564    0    0     1     5    1    2 12  3 85  0  0
30  0      0 7072244    140 4363620    0    0     0     0 10523 6329 97  3  0  0  0
40  0      0 7052732    140 4363584    0    0   472   176 11478 8399 95  5  0  0  0
40  0      0 7070636    140 4363660    0    0     0     0 9881 6546 98  2  0  0  0
28  0      0 7074060    140 4363676    0    0     0     0 9919 6520 98  2  0  0  0
38  0      0 7074180    140 4363688    0    0     0     0 10801 7946 97  3  0  0  0
34  0      0 7074228    140 4363692    0    0     0     0 10464 6298 97  3  0  0  0

See? us CPU usage is high, and the CPU queue has increased a lot. The in value hasn’t changed much, but cs has significantly decreased. This indicates that we haven’t simulated a situation where the CPU is contested, we’ve just consumed CPU resources.

At this point, the graphs in the stress tool look like this:

From the TPS and response time, we can see that the business is indeed slower. However, there are no errors. This scenario is a typical case of the application being slowed down due to insufficient CPU.

Memory Abnormalities #

Memory abnormalities are also a major focus of performance analysis. Let’s simulate with the following command. In this command, we use 30 worker threads to allocate 10G of memory for 50 seconds:

stress --vm 30 --vm-bytes 10G --vm-hang 50 --timeout 50s

During the stress period, I executed this command twice. Let’s take a look at the TPS graph:

As you can see, the application didn’t report any errors, but the response time increased significantly.

From the examples of CPU anomalies and memory anomalies, you should easily notice that the operating system is quite resilient. Even when resources are scarce, it still tries its best to serve you.

So what are the next steps for these anomalies? First, we need to identify the problem areas and then resolve them. As for the specific analysis steps, we described the RESAR performance engineering analysis logic in section 3.

If you encounter CPU or memory issues in a production environment, the most important thing is whether the system can recover quickly. Therefore, if you see CPU or memory consumption is high and the TPS drops or the response time increases during the simulation, and the TPS and response time have not recovered during the simulation, then you can report a bug. Because we expect the business to recover.

Network Anomalies #

Networking is a very complex topic, with many details involved. However, anyone doing performance analysis must understand networking, not to delve deep, but to be able to judge problems. I’m going to discuss two common network anomalies: packet loss and latency.

First, let me explain that I use the tc command of the operating system to simulate network packet loss and latency, because it is the simplest and most direct method. Many chaos engineering tools also use this command. I feel that using the command is more straightforward, without the need to install any tools, making it convenient and fast.

Packet Loss #

Let’s simulate a packet loss of 10% first:

tc qdisc add dev eth0 root netem loss 10%

Then, let’s check the corresponding stress tool curve:

Stress Tool Curve

As you can see, during the packet loss simulation, only 10% of the packets were lost, and the TPS dropped from around 200 to around 80.

When the TCP layer detects packet loss, it triggers retransmission according to the TCP retransmission mechanism (you can search for more information about this). If we capture packets at this time, we will see retransmitted packets.

In this case, the response time will increase, but the business has not reached the error level yet, as we can see from the figure above.

To monitor the current network health status, we can use the ping command during the packet loss simulation:

C:\Users\Zee>ping 172.16.106.79 -t

Pinging 172.16.106.79 with 32 bytes of data:
Reply from 172.16.106.79: bytes=32 time=79ms TTL=62
Reply from 172.16.106.79: bytes=32 time=57ms TTL=62
Reply from 172.16.106.79: bytes=32 time=74ms TTL=62
Reply from 172.16.106.79: bytes=32 time=60ms TTL=62
Reply from 172.16.106.79: bytes=32 time=55ms TTL=62
Request timed out.
Request timed out.
Reply from 172.16.106.79: bytes=32 time=71ms TTL=62
Reply from 172.16.106.79: bytes=32 time=75ms TTL=62
Reply from 172.16.106.79: bytes=32 time=71ms TTL=62
Reply from 172.16.106.79: bytes=32 time=71ms TTL=62
Reply from 172.16.106.79: bytes=32 time=62ms TTL=62
Request timed out.
Reply from 172.16.106.79: bytes=32 time=51ms TTL=62
Reply from 172.16.106.79: bytes=32 time=64ms TTL=62
Reply from 172.16.106.79: bytes=32 time=74ms TTL=62
Reply from 172.16.106.79: bytes=32 time=83ms TTL=62
Reply from 172.16.106.79: bytes=32 time=69ms TTL=62

Clearly, packet loss is occurring, right? Logically speaking, packet loss and retransmission will cause a decrease in TPS and an increase in response time, but no errors will occur, thanks to TCP.

However, if during the entire simulation process, the business does not recover automatically, then you should report a bug. Because for a mature architecture, it should be able to determine the application nodes that experience packet loss and perform traffic forwarding control. This is also what the cluster's strategy should ensure.

#### Latency

Latency issues are common, and high traffic or contention for network device resources can cause network latency. When the latency is high, we should pay attention to network devices such as routers and switches to see if they are experiencing high latency issues. Finally, let's not forget about firewalls because they can also be configured with penalty rules that increase latency.

Here, we add 100ms of delay to the network on the local machine to simulate network latency. The command for simulating network latency is as follows:

tc qdisc add dev eth0 root netem delay 100ms


To determine if latency has occurred is relatively easy, just use the ping command. The effect of pinging will be as follows:
    64 bytes from 172.16.106.79: icmp_seq=73 ttl=64 time=0.234 ms
    64 bytes from 172.16.106.79: icmp_seq=74 ttl=64 time=0.259 ms
    64 bytes from 172.16.106.79: icmp_seq=75 ttl=64 time=0.280 ms
    64 bytes from 172.16.106.79: icmp_seq=76 ttl=64 time=0.312 ms
    64 bytes from 172.16.106.79: icmp_seq=77 ttl=64 time=0.277 ms
    64 bytes from 172.16.106.79: icmp_seq=78 ttl=64 time=0.231 ms
    64 bytes from 172.16.106.79: icmp_seq=79 ttl=64 time=0.237 ms
    64 bytes from 172.16.106.79: icmp_seq=80 ttl=64 time=100 ms
    64 bytes from 172.16.106.79: icmp_seq=81 ttl=64 time=100 ms
    64 bytes from 172.16.106.79: icmp_seq=82 ttl=64 time=100 ms
    64 bytes from 172.16.106.79: icmp_seq=83 ttl=64 time=100 ms
    64 bytes from 172.16.106.79: icmp_seq=84 ttl=64 time=100 ms
    64 bytes from 172.16.106.79: icmp_seq=85 ttl=64 time=100 ms
    64 bytes from 172.16.106.79: icmp_seq=86 ttl=64 time=100 ms
    64 bytes from 172.16.106.79: icmp_seq=87 ttl=64 time=100 ms
    

Do you see it? The ping time has directly jumped to 100ms, which is consistent with the latency added to the network.

After the latency appears, the TPS graph of the entire system looks like this:

![](../images/3ccc6e226dd642c79066f472eb8fd310.jpg)

It can be seen clearly that network latency leads to a decrease in TPS and an increase in response time, and the impact is very obvious and direct: we only simulated a 100ms delay, and the response time increased by tens of times.

In response to the impact of network latency on the business, our coping mechanism is still rapid recovery. At this time, we need to see if there are any backup resources on the network. If the standby resources do not take effect during the simulated period, then you can report a bug.

### Disk Abnormality

For disk abnormalities, there are many tools available for simulation. However, because I prefer fio for its simplicity and convenience, I will use fio to simulate a large number of random writes for abnormality:

fio –filename=fio.tmp –direct=1 –rw=randwrite –bs=4k –size=1G –numjobs=64 –runtime=100 –group_reporting –name=test-rand-write


Next, let's see if the wa CPU has increased in top:

%Cpu0 : 46.2 us, 4.3 sy, 0.0 ni, 2.4 id, 46.6 wa, 0.0 hi, 0.5 si, 0.0 st %Cpu1 : 15.5 us, 8.3 sy, 0.0 ni, 2.9 id, 70.9 wa, 0.0 hi, 1.9 si, 0.5 st %Cpu2 : 13.8 us, 6.9 sy, 0.0 ni, 3.4 id, 74.4 wa, 0.0 hi, 1.5 si, 0.0 st %Cpu3 : 24.1 us, 7.9 sy, 0.0 ni, 0.0 id, 67.5 wa, 0.0 hi, 0.5 si, 0.0 st %Cpu4 : 27.1 us, 6.4 sy, 0.0 ni, 0.0 id, 65.5 wa, 0.0 hi, 1.0 si, 0.0 st %Cpu5 : 19.8 us, 5.9 sy, 0.0 ni, 3.5 id, 69.8 wa, 0.0 hi, 1.0 si, 0.0 st


From the data above, the wa CPU has reached around 70%, which is what we want.

Let's take a look at the corresponding TPS graph:

![](../images/151fd9c7198d410fafaea521aed231a8.jpg)

As you can see, the TPS curve drops in the middle, and the response time tends to increase, but no errors are reported.

Although the response time has increased, please note that the duration of continuous simulation here is longer than the time period shown in the TPS graph where it decreases. This indicates that our application does not have a strong dependency on IO. Think about it, this application writes logs asynchronously. How much dependency can there be?

Here I want to clarify that the wa CPU is not actually consumed; it is an idle CPU, and its percentage only records the proportion of CPU time slices waiting for IO. So, although the wa cp looks high, if there are other applications that need CPU, they can still preempt it.

For operating system-level abnormalities, we have demonstrated using CPU, memory, network, and disk, which are the most important system resources. Based on this approach, you can make more extensions in specific projects, and there will be many abnormal scenarios that can be designed.
## Container Exception

For the popular Kubernetes+container architecture in the current technology market, it is unacceptable not to handle container-level exceptions.

As we know, the base images of containers vary in size, depending on which image you are using. However, let's not worry about that for now and simulate from the perspective of operating containers. Because if a container encounters an exception, Kubernetes will basically operate on the entire container without making any detailed adjustments to the container.

Here, let's take a look at the two operations that Kubernetes often performs on containers, killing containers and evicting containers.

### Killing Containers

To facilitate the operation, let's first assign both portal instances to a single worker.

Let's first check if there are already two portal instances on this worker, so that we can compare them later.

[root@k8s-worker-6 ~]# docker ps |grep portal c39df7dc8b1b 243a962aa179 “java -Dapp.id=svc-m…” About a minute ago Up About a minute k8s_mall-portal_svc-mall-portal-5845fcd577-dddlp_default_4ccb1155-5521-460a-b96e-e2a22a82f5ee_0 4be31b5e728b registry.aliyuncs.com/k8sxio/pause:3.2 “/pause” About a minute ago Up About a minute k8s_POD_svc-mall-portal-5845fcd577-dddlp_default_4ccb1155-5521-460a-b96e-e2a22a82f5ee_0 c9faa33744e0 243a962aa179 “java -Dapp.id=svc-m…” About a minute ago Up About a minute k8s_mall-portal_svc-mall-portal-5845fcd577-cx5th_default_06117072-9fe2-4882-8939-3a313cf1b3ad_0 7b876dd6b860 registry.aliyuncs.com/k8sxio/pause:3.2 “/pause” About a minute ago Up About a minute k8s_POD_svc-mall-portal-5845fcd577-cx5th_default_06117072-9fe2-4882-8939-3a313cf1b3ad_0


You see, there are indeed these two portal instances on this worker.

Now let's kill one pod and see how Kubernetes reacts.

[root@k8s-worker-6 ~]# docker kill -s KILL c39df7dc8b1b c39df7dc8b1b


Next, let's execute the command to check the POD ID of the current Portal POD:

[root@k8s-worker-6 ~]# docker ps |grep portal 080b1e4bd3b3 243a962aa179 “java -Dapp.id=svc-m…” 58 seconds ago Up 57 seconds k8s_mall-portal_svc-mall-portal-5845fcd577-dddlp_default_4ccb1155-5521-460a-b96e-e2a22a82f5ee_1 4be31b5e728b registry.aliyuncs.com/k8sxio/pause:3.2 “/pause” 4 minutes ago Up 4 minutes k8s_POD_svc-mall-portal-5845fcd577-dddlp_default_4ccb1155-5521-460a-b96e-e2a22a82f5ee_0 c9faa33744e0 243a962aa179 “java -Dapp.id=svc-m…” 4 minutes ago Up 4 minutes k8s_mall-portal_svc-mall-portal-5845fcd577-cx5th_default_06117072-9fe2-4882-8939-3a313cf1b3ad_0 7b876dd6b860 registry.aliyuncs.com/k8sxio/pause:3.2 “/pause” 4 minutes ago Up 4 minutes k8s_POD_svc-mall-portal-5845fcd577-cx5th_default_06117072-9fe2-4882-8939-3a313cf1b3ad_0 [root@k8s-worker-6 ~]#


It is not difficult to see that one POD ID has changed, indicating that Kubernetes has automatically restarted the killed POD.

The corresponding TPS effect is as follows:

![](../images/bef3d332493b438286113865210db55d.jpg)

Because there are two Portal instances, the TPS did not drop to the bottom, which means that the other POD can take over the traffic. So, although there were errors in the business, it quickly recovered. Please note that here I use the term "recovered" to mean that the business has been taken over by another container, not that the killed container has also completed its startup.

To verify Kubernetes' ability to handle the restart of exceptional PODs, let's directly kill both portal PODs to try it out.

[root@k8s-worker-6 ~]# docker kill -s KILL 080b1e4bd3b3 c9faa33744e0

    080b1e4bd3b3
    c9faa33744e0
    [root@k8s-worker-6 ~]# docker ps |grep portal
    d896adf1a85e        243a962aa179                             "java -Dapp.id=svc-m…"   About a minute ago   Up About a minute                       k8s_mall-portal_svc-mall-portal-5845fcd577-dddlp_default_4ccb1155-5521-460a-b96e-e2a22a82f5ee_2
    baee61034b5f        243a962aa179                             "java -Dapp.id=svc-m…"   About a minute ago   Up About a minute                       k8s_mall-portal_svc-mall-portal-5845fcd577-cx5th_default_06117072-9fe2-4882-8939-3a313cf1b3ad_1
    4be31b5e728b        registry.aliyuncs.com/k8sxio/pause:3.2   "/pause"                 7 minutes ago        Up 7 minutes                            k8s_POD_svc-mall-portal-5845fcd577-dddlp_default_4ccb1155-5521-460a-b96e-e2a22a82f5ee_0
    7b876dd6b860        registry.aliyuncs.com/k8sxio/pause:3.2   "/pause"                 7 minutes ago        Up 7 minutes                            k8s_POD_svc-mall-portal-5845fcd577-cx5th_default_06117072-9fe2-4882-8939-3a313cf1b3ad_0
    [root@k8s-worker-6 ~]# 

Let's take a look at the corresponding TPS curve:

![](../images/e30caa2321364a31b9d04346b3ec020d.jpg)

Now it is obvious that because I only started two instances of the Portal service, after killing the PODs of the two Portal services, the TPS curve of the business directly reported errors and recovered after about 1 minute and 30 seconds. Whether this recovery time is long or not depends on the success rate indicator of the business.

In this example, we see that the container can be automatically restored, indicating that Kubernetes is working, and we only need to pay attention to whether the recovery time meets the success rate indicator of the business.

### Evict Container

"Container eviction" is a common problem in Kubernetes when resources are insufficient.

Now, I will directly click on "Evict" in the container management tool to simulate the scenario.

To show that the POD has indeed been moved to another worker after eviction, before simulating, let's first determine the current status of the Order service:

![](../images/ddbb7a0ea7a045acada6a5bd6c91bb46.jpg)

As you can see, this service is in a normal Running state.

Then, let's simulate the eviction of containers. We only need to find the container in the Kubernetes management interface and click the "Evict" button directly.

After the simulation, let's see the effect:

Before eviction:

[root@k8s-master-1 ~]# kubectl get pods -o wide | grep portal svc-mall-portal-54ddfd6798-766pj 1/1 Running 0 36h 10.100.227.136 k8s-worker-6 svc-mall-portal-54ddfd6798-ckg7f 1/1 Running 0 36h 10.100.227.137 k8s-worker-6

After eviction:

[root@k8s-master-1 ~]# kubectl get pods -o wide | grep portal svc-mall-portal-7f7f69c6cf-5czlz 1/1 Running 0 47s 10.100.69.242 k8s-worker-3 svc-mall-portal-7f7f69c6cf-7h8js 1/1 Running 0 110s 10.100.140.30 k8s-worker-2 [root@k8s-master-1 ~]#


As you can see, the POD's ID and worker have changed, indicating that Kubernetes has brought up the evicted POD.

Let's take a look at the corresponding TPS curve:

![](../images/bf394026992b42499579b6a9f7957bdc.jpg)

It can be seen that the container after eviction has also been restored.
## Virtual Machine Exception

In the previous section, we simulated exceptions inside the operating system. Now, let's take a different perspective and operate from the standpoint of the entire KVM virtual machine operating system, to see what effects virtual machine exceptions have.

Here, we will directly kill the virtual machine to simulate an exception. In fact, this exception has already been mentioned in Lesson 26, indicating that it is a relatively common exception scenario.

### Killing the Virtual Machine

First, let's move the application microservice to worker-6. Afterwards, we will directly kill this worker-6 virtual machine. But please note that you should not assign our microservice to worker-6 because if you do, the microservice will not be able to run on other virtual machines.

Then, let's perform the action of killing the virtual machine. Here's how you do it:

[root@dell-server-2 ~]# virsh list –all Id 名称 状态 #

1 vm-k8s-master-2 running 2 vm-k8s-worker-5 running 3 vm-k8s-worker-6 running

There is an action of using top to view the virtual machine process ID, just type top and then press c.

[root@dell-server-2 ~]# kill -9 3822 [root@dell-server-2 ~]# virsh list –all Id 名称 状态 #

1 vm-k8s-master-2 running 2 vm-k8s-worker-5 running

  • vm-k8s-worker-6                关闭
    

[root@dell-server-2 ~


As you can see, worker-6 was indeed shut down.

Now let's take a look at the corresponding TPS:

![](../images/cc7cda8009f64b7a9994c21e70c920fe.jpg)

As you can see, after worker-6 was killed, the TPS drops to zero directly, and an error is also reported. After a while, the application is moved and the service is restored.

Finally, let's see the effect of the migration:

Before migration #

[root@k8s-master-1 ~]# kubectl get pods -o wide | grep portal svc-mall-portal-54ddfd6798-766pj 1/1 Running 0 36h 10.100.227.136 k8s-worker-6 svc-mall-portal-54ddfd6798-ckg7f 1/1 Running 0 36h 10.100.227.137 k8s-worker-6

After migration #

[root@k8s-master-1 ~]# kubectl get pods -o wide | grep portal svc-mall-portal-7f7f69c6cf-5kvtl 1/1 Running 0 4m40s 10.100.69.249 k8s-worker-3 svc-mall-portal-7f7f69c6cf-jz48w 1/1 Running 0 4m50s 10.100.140.24 k8s-worker-2 [root@k8s-master-1 ~]#


As you can see, the application on worker-6 has been scheduled to other nodes (worker-2, worker-3), indicating that new containers have been generated.
## Summary

If you are going to create such exceptional scenarios, please consider your expectations beforehand. **The most basic expectation for exceptional scenarios is that the system can quickly recover when exceptions occur, which is also the value of creating exceptional scenarios**. If the system cannot recover quickly, the business will decline along with the exceptions, and then we need to report bugs and risks.

In this lesson, we simulated application-level exceptions, operating system internal exceptions, container-level exceptions, and overall operating system-level exceptions. These are common exceptional scenarios in the current microservices architecture. Of course, they do not cover all exceptional scenarios in the microservices architecture. You can design the missing exceptional scenarios based on the exception scope diagram discussed in the previous lesson to achieve comprehensive coverage.
## Homework

Finally, I have two questions for you to think about:

1. What are the key points in designing exception scenarios?
2. How do you determine the expected outcome of exception scenarios?

Remember to discuss and exchange your thoughts with me in the comment section. Every thought will help you move forward.

If you have gained something from reading this article, feel free to share it with your friends and let's learn and progress together. See you in the next lesson!