25 Capacity Scenarios Second How Caching Can Affect Performance

25 Capacity scenarios second how caching can affect performance #

Hello, I am Gao Lou.

In the previous lesson, we went through three stages of analysis and optimization. We addressed the issue where response time increases over time when the pressure thread remains constant, the problem of adding indexes to the database, and the issue of uneven Kubernetes scheduling. Finally, the TPS curve looked normal, but fate won’t let me off the hook just because I tried my best.

Why do I say this? It’s because in the previous lesson, our scenario only lasted for a dozen minutes, which is not long enough for a capacity scenario. You see, even if the pressure lasts for a dozen minutes and the TPS appears normal, it doesn’t mean that there are no issues with the system.

Therefore, I conducted continuous stress testing on the system, and it was during this process that I encountered new problems…

Phase 4 Analysis #

Scene Pressure Data #

Here is the scene data I obtained during the continuous pressure process:

From the above curve, we can see that there is a problem with TPS dropping during the continuous pressure process, which is unacceptable.

Splitting Response Time #

For the above problem, let’s first look at the current time consumption. This is the response time chart after running for a while:

We can analyze the time consumption of these interfaces one by one based on the overall average response time. In fact, from this chart, we can see that the response time of all businesses has increased compared to the response time chart in the last lesson. Since the response time of all businesses has increased, it means that it is not the problem of a specific business, so we can analyze any interface arbitrarily.

Here I use the interface of generating a confirmation order for time splitting. When deploying the system before, we set the sampling rate of SkyWalking very low, only about 5%, in order to avoid APM affecting performance and network.

The following data is averaged per minute.

Gateway:
Order:
Cart:
Member:
Auth:

From the data, it seems that each service is related to the overall slow response time. Unfortunately, scenarios like this always happen.

However, don’t panic. We can still follow the analysis approach from the overall to specific. Let’s first look at the global monitoring data.

Global Monitoring Analysis #

From the first page of the global monitoring, the CPU resource usage is relatively high on worker-4, followed by worker-6:

Let’s take care of them one by one.

First, let’s enter worker-4 and execute the top/vmstat command to capture some important data. The following is a more detailed global monitoring data for the worker-4 node:

-- vmstat data
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa s
12  0      0 4484940   1100 4184984    0    0     0     0 45554 21961 58 25 14  0  3
 9  0      0 4484460   1100 4185512    0    0     0     0 45505 20851 60 25 14  0  1
16  0      0 4483872   1100 4186016    0    0     0     0 44729 20750 62 24 12  0  2
15  0      0 4470944   1100 4186476    0    0     0     0 45309 25481 62 24 13  0  2
14  0      0 4431336   1100 4186972    0    0     0     0 48380 31344 60 25 14  0  1
16  0      0 4422728   1100 4187524    0    0     0     0 46735 27081 64 24 12  0  1
17  0      0 4412468   1100 4188004    0    0     0     0 45928 23809 60 25 13  0  2
22  0      0 4431204   1100 4188312    0    0     0     0 46013 24588 62 23 13  0  1
12  0      0 4411116   1100 4188784    0    0     0     0 49371 34817 59 24 15  0  2
16  1      0 4406048   1100 4189016    0    0     0     0 44410 21650 66 23 10  0  1
..................

--- top data
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
  935 root      20   0 6817896   1.3g  16388 S 301.7  8.7  71:26.27 java -Dapp.id=svc-mall-gateway -javaagent:/opt/skywalking/agent/skywalking-agent.jar -Dskywalking.agent.service_name=svc+
 1009 101       20   0   13500   2980    632 R  37.6  0.0  10:13.51 nginx: worker process                                                                                                    
 1007 101       20   0   13236   2764    632 R  20.8  0.0   3:17.14 nginx: worker process                                                                                                    
 7690 root      20   0 3272448   3.0g   1796 S  14.2 19.1   2:58.31 redis-server 0.0.0.0:6379                                                                                                
 6545 101       20   0  737896  48804  12640 S  13.9  0.3  12:36.09 /nginx-ingress -nginx-configmaps=nginx-ingress/nginx-config -default-server-tls-secret=nginx-ingress/default-server-secr+
 1108 root      20   0 1423104 106236  29252 S  12.2  0.7  16:28.02 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock                                                   
 1008 101       20   0   13236   2760    628 S   6.9  0.0   0:46.30 nginx: worker process                                                                                                    
 6526 root      20   0  109096   8412   2856 S   6.3  0.1   7:30.98 containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/6eb72c56b028b0d5bd7f8df+
 1082 root      20   0 3157420 116036  36328 S   5.3  0.7  11:15.65 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf+
 6319 nfsnobo+  20   0  759868  53880  18896 S   3.0  0.3   3:18.33 grafana-server --homepath=/usr/share/grafana --config=/etc/grafana/grafana.ini --packaging=docker cfg:default.log.mode=c+
 6806 root      20   0 1632160  47276  17108 S   2.6  0.3   5:09.43 calico-node -felix                                                                                                       
    6 root      20   0       0      0      0 S   1.0  0.0   0:14.19 [ksoftirqd/0]
..................

Based on the “in” column in the vmstat data, it can be seen that the system’s interrupt count is relatively high. From the “cs” (context switch) in the vmstat data, CS has also reached about 30,000. At the same time, the CPU consumption of “sy” cpu (system calls) also accounts for about 25%. This indicates that we need to focus on the system call level.

Targeted Monitoring and Analysis #

Let’s first look at the interrupt data. The following screenshot shows all the interrupt data:

From this screenshot, we can see the overall interrupt data and the changes in the data from the black-on-white numbers. However, we haven’t reached any conclusions yet.

Let’s now look at the soft interrupt data:

From the black-on-white numbers, we can see that the NET_RX value changes significantly (TIMER is the system clock, which we don’t need to analyze). Furthermore, both the Gateway and Redis are running on this server called worker-4, and both of these services generate a lot of network traffic. Clearly, this is the area we need to analyze.

Since the network interrupts are relatively high, let’s go into the Pod and check the network queue:

As you can see, there is a recv_Q here. We know that recv_Q is the receive queue for network data, and if it consistently has values, it indicates that the receive queue is indeed blocked.

Please note that this conclusion is not based on just one netstat command; I have run it multiple times. Since this queue appears every time, I can conclude that there is indeed unfinished data in the recv_Q of the network receive queue. If it only appears occasionally, then it wouldn’t be a big problem.

Now, let’s continue analyzing this issue and provide a solution later.

Since there are values in the receive queue, from the perspective of data flow, this should mean that the upper-layer application is not processing it in a timely manner. Therefore, let’s trace the execution time of the generateConfirmOrder interface, which is responsible for generating confirmation orders (actually, it doesn’t matter which method we trace here because as mentioned earlier, all business operations are slow):

Command execution times exceed limit: 5, so command will exit. You can set it with -n option.
Condition express: 1==1 , result: true
`---ts=2021-02-18 19:20:15;thread_name=http-nio-8086-exec-113;id=3528;is_daemon=true;priority=5;TCCL=org.springframework.boot.web.embedded.tomcat.TomcatEmbeddedWebappClassLoader@20a46227
    `---[151.845221ms] com.dunshan.mall.order.service.impl.PortalOrderServiceImpl$$EnhancerBySpringCGLIB$$11e4c326:generateConfirmOrder(
        `---[151.772564ms] org.springframework.cglib.proxy.MethodInterceptor:intercept() #5
            `---[151.728833ms] com.dunshan.mall.order.service.impl.PortalOrderServiceImpl:generateConfirmOrder(
                +---[0.015801ms] com.dunshan.mall.order.domain.ConfirmOrderResult:<init>() #8
                +---[75.263121ms] com.dunshan.mall.order.feign.MemberService:getCurrentMember() #8
                +---[0.006396ms] com.dunshan.mall.model.UmsMember:getId() #9
                +---[0.004322ms] com.dunshan.mall.model.UmsMember:getId() #9
                +---[0.008234ms] java.util.List:toArray() #5
                +---[min=0.006794ms,max=0.012615ms,total=0.019409ms,count=2] org.slf4j.Logger:info() #5
                +---[0.005043ms] com.dunshan.mall.model.UmsMember:getId() #9
                +---[28.805315ms] com.dunshan.mall.order.feign.CartItemService:listPromotionnew() #5
                +---[0.007123ms] com.dunshan.mall.order.domain.ConfirmOrderResult:setCartPromotionItemList() #9
                +---[0.012758ms] com.dunshan.mall.model.UmsMember:getList() #10
                +---[0.011984ms] com.dunshan.mall.order.domain.ConfirmOrderResult:setMemberReceiveAddressList() #5
                +---[0.03736ms] com.alibaba.fastjson.JSON:toJSON() #11
                +---[0.010188ms] com.dunshan.mall.order.domain.OmsCartItemVo:<init>() #12
                +---[0.005661ms] com.dunshan.mall.order.domain.OmsCartItemVo:setCartItemList() #12
                +---[19.225703ms] com.dunshan.mall.order.feign.MemberService:listCart() #12
                +---[0.010474ms] com.dunshan.mall.order.domain.ConfirmOrderResult:setCouponHistoryDetailList() #5
                +---[0.007807ms] com.dunshan.mall.model.UmsMember:getIntegration() #13
                +---[0.009189ms] com.dunshan.mall.order.domain.ConfirmOrderResult:setMemberIntegration() #5
                +---[27.471129ms] com.dunshan.mall.mapper.UmsIntegrationConsumeSettingMapper:selectByPrimaryKey() #13
                +---[0.019764ms] com.dunshan.mall.order.domain.ConfirmOrderResult:setIntegrationConsumeSetting() #13
                +---[0.154893ms] com.dunshan.mall.order.service.impl.PortalOrderServiceImpl:calcCartAmount() #13
                `---[0.013139ms] com.dunshan.mall.order.domain.ConfirmOrderResult:setCalcAmount() #13

As you can see, this interface contains a getCurrentMember method, which is a service on Member used to retrieve current user information, and other services also use this service because they need the Token.

From the stack information above, it can be seen that getCurrentMember takes more than 75ms, which is clearly slow. Let’s trace this method to identify where the slowness occurs:

Condition express: 1==1 , result: true
`---ts=2021-02-18 19:43:18;thread_name=http-nio-8083-exec-25;id=34bd;is_daemon=true;priority=5;TCCL=org.springframework.boot.web.embedded.tomcat.TomcatEmbeddedWebappClassLoader@6cb759d5
    `---[36.139809ms] com.dunshan.mall.member.service.imp.MemberServiceImpl:getCurrentMember(
        +---[0.093398ms] javax.servlet.http.HttpServletRequest:getHeader() #18
        +---[0.020236ms] cn.hutool.core.util.StrUtil:isEmpty() #18
        +---[0.147621ms] cn.hutool.json.JSONUtil:toBean() #19
        +---[0.02041ms] com.dunshan.mall.common.domain.UserDto:getId() #19
        `---[35.686266ms] com.dunshan.mall.member.service.MemberCacheService:getMember() #5

For this type of data that needs to be captured instantly, you need to capture it multiple times to confirm. Although I’m only showing one here, I actually captured it multiple times before getting this result. From the data above, it can be seen that the getMember method used in getCurrentMember takes a relatively long time of over 35ms.

Now let’s take a look at the specific implementation of getMember:

@Override
public UmsMember getMember(Long memberId) {
    String key = REDIS_DATABASE + ":" + REDIS_KEY_MEMBER + ":" + memberId;
    return (UmsMember) redisService.get(key);
}

The logic of this code is simple: it concatenates the key information and then retrieves the corresponding Member information from Redis. Since the getMember function retrieves data from Redis, let’s check the slowlog in Redis:

127.0.0.1:6379> slowlog get
1) 1) (integer) 5
   1) (integer) 1613647620
   2) (integer) 30577
   3) 1) "GET"
      1) "mall:ums:member:2070064"
   4) "10.100.140.46:53152"
   5) ""
2) 1) (integer) 4
   1) (integer) 1613647541
   2) (integer) 32878
   3) 1) "GET"
      1) "mall:ums:member:955622"
   4) "10.100.140.46:53152"
   5) ""
........................

You see, the GET command is indeed slow, taking more than 10ms (the slowlog default setting is to record commands that take more than 10ms). If this command is not executed frequently, it might not be a big deal. However, since this command is used for user authentication, such a long execution time is unacceptable.

Why do I say this?

Because from a business perspective, besides opening the homepage and querying products, it seems like other scripts also need to use this command. So, its slowness doesn’t just affect one business, but a bunch of businesses.

However, just as we were analyzing this and before we had a chance to optimize it, a new problem emerged… During the continuation of the load test, I noticed the following phenomenon:

Look at this, not only is the TPS unstable, but all the requests are also throwing errors. This is really inappropriate!

So, I started investigating the error logs and eventually discovered that the Redis containers were struggling. The following screenshot shows the state of Redis in the architecture:

Clearly, Redis is down! At this point, let’s take a look at the state of the application:

It’s a complete mess! Next, we log in to the worker node where the Redis service is running and check the logs:

[ 7490.807349] redis-server invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=807
[ 7490.821216] redis-server cpuset=docker-18cc9a81d8a58856ecf5fed45d7db431885b33236e5ad50919297cec453cebe1.scope mems_allowed=0
[ 7490.826286] CPU: 2 PID: 27225 Comm: redis-server Kdump: loaded Tainted: G               ------------ T 3.10.0-1127.el7.x86_64 #1
[ 7490.832929] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[ 7490.836240] Call Trace:
[ 7490.838006]  [<ffffffff9af7ff85>] dump_stack+0x19/0x1b
[ 7490.841975]  [<ffffffff9af7a8a3>] dump_header+0x90/0x229
[ 7490.844690]  [<ffffffff9aa9c4a8>] ? ep_poll_callback+0xf8/0x220
[ 7490.847625]  [<ffffffff9a9c246e>] oom_kill_process+0x25e/0x3f0
[ 7490.850515]  [<ffffffff9a933a41>] ? cpuset_mems_allowed_intersects+0x21/0x30
[ 7490.853893]  [<ffffffff9aa40ba6>] mem_cgroup_oom_synchronize+0x546/0x570
[ 7490.857075]  [<ffffffff9aa40020>] ? mem_cgroup_charge_common+0xc0/0xc0
[ 7490.860348]  [<ffffffff9a9c2d14>] pagefault_out_of_memory+0x14/0x90
[ 7490.863651]  [<ffffffff9af78db3>] mm_fault_error+0x6a/0x157
[ 7490.865928]  [<ffffffff9af8d8d1>] __do_page_fault+0x491/0x500
[ 7490.868661]  [<ffffffff9af8da26>] trace_do_page_fault+0x56/0x150
[ 7490.871811]  [<ffffffff9af8cfa2>] do_async_page_fault+0x22/0xf0
[ 7490.874423]  [<ffffffff9af897a8>] async_page_fault+0x28/0x30
[ 7490.877127] Task in /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod6e897c3a_8b9f_479b_9f53_33d2898977b0.slice/docker-18cc9a81d8a58856ecf5fed45d7db431885b33236e5ad50919297cec453cebe1.scope killed as a result of limit of /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod6e897c3a_8b9f_479b_9f53_33d2898977b0.slice/docker-18cc9a81d8a58856ecf5fed45d7db431885b33236e5ad50919297cec453cebe1.scope
[ 7490.893825] memory: usage 3145728kB, limit 3145728kB, failcnt 176035
[ 7490.896099] memory+swap: usage 3145728kB, limit 3145728kB, failcnt 0
[ 7490.899137] kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
[ 7490.902012] Memory cgroup stats for /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod6e897c3a_8b9f_479b_9f53_33d2898977b0.slice/docker-18cc9a81d8a58856ecf5fed45d7db431885b33236e5ad50919297cec453cebe1.scope: cache:72KB rss:3145656KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:3145652KB inactive_file:0KB active_file:20KB unevictable:0KB
[ 7490.962494] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
[ 7490.966577] [27197]     0 27197      596      166       5        0           807 sh
[ 7490.970286] [27225]     0 27225   818112   786623    1550        0           807 redis-server
[ 7490.974006] [28322]     0 28322      999      304       6        0           807 bash
[ 7490.978178] Memory cgroup out of memory: Kill process 27242 (redis-server) score 1808 or sacrifice child
[ 7490.983765] Killed process 27225 (redis-server), UID 0, total-vm:3272448kB, anon-rss:3144732kB, file-rss:1760kB, shmem-rss:0kB

It turns out that the memory on the worker node was not sufficient, and Redis had an OOM score of 1808. As a result, the operating system killed Redis without hesitation.

After restarting Redis, let’s observe its memory consumption again. The result is as follows:

[root@k8s-worker-4 ~]# pidstat -r -p 5356 1
Linux 3.10.0-1127.el7.x86_64 (k8s-worker-4) 	2021年02月18日 	_x86_64_	(6 CPU)

19:55:52   UID       PID  minflt/s  majflt/s     VSZ    RSS   %MEM  Command
19:55:53     0      5356     32.00      0.00 3272448 1122152   6.90  redis-server
19:55:54     0      5356     27.00      0.00 3272448 1122416   6.90  redis-server
19:55:55     0      5356     28.00      0.00 3272448 1122416   6.90  redis-server
19:55:56     0      5356     28.00      0.00 3272448 1122680   6.90  redis-server
19:55:57     0      5356     21.78      0.00 3272448 1122680   6.90  redis-server
19:55:58     0      5356     38.00      0.00 3272448 1122880   6.90  redis-server
19:55:59     0      5356     21.00      0.00 3272448 1122880   6.90  redis-server
19:56:00     0      5356     25.00      0.00 3272448 1122880   6.90  redis-server

I only captured a small section of data before Redis died, and then observed this section of data continuously using RSS (actual memory usage). I found that the memory did keep increasing. Then I checked the Redis configuration file and found that maxmemory was not configured.

It’s not a big problem if it’s not configured. If there is not enough memory, then there is not enough. After all, Pods have memory limits, right? Unfortunately, the memory on the worker was not sufficient, which caused the Redis process to be killed by the operating system. This explains the issue of the TPS chart reporting errors in the second half.

However, we still need to continue analyzing the slow response time. We saw earlier that soft interrupts are related to bandwidth. In order to reduce the mutual influence between service interruptions, I will separate the Redis and Gateway services later.

As we all know, Redis maintains data in memory. If only memory operations are performed, it will be very fast. But Redis also has a function that is more related to memory, which is persistence. We currently use the AOF persistence strategy and do not limit the size of the AOF file.

This persistence file is stored on an NFS file server. Since it is stored on a file server, it needs sufficient disk IO capability. Therefore, let’s check the IO capability on the NFS server. Here is a section of the data:

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00   65.00    0.00  6516.00     0.00   200.49     1.85   28.43   28.43    0.00   3.95  25.70


Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00   24.00    0.00   384.00     0.00    32.00     0.15    6.46    6.46    0.00   6.46  15.50


Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00    8.00    0.00  1124.00     0.00   281.00     0.07    8.38    8.38    0.00   4.00   3.20


..........................


Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00   11.00    0.00   556.00     0.00   101.09     0.15   13.55   13.55    0.00  10.36  11.40


Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00    4.00    0.00    32.00     0.00    16.00     0.08   19.25   19.25    0.00  15.25

By observing the svctm parameter (IO response time counter), we can see that the IO response time has also increased. Although svctm is not recommended for use in the new version of sysstat, it is still present in the version we are currently using. And through it, we can indeed see that the IO response time is increasing.

To prove that the IO response time is related to AOF, let’s first turn off AOF by setting appendonly to no and see the effect. If it has an effect, then the optimization direction is very clear, and we need to take the following optimization actions:

Move Redis to a worker with less network demand and observe if the TPS improves. If this step is effective, we don’t need to do the next step;
If there is no improvement after the previous step, turn off AOF and observe the TPS. If there is improvement after turning off AOF, then we need to analyze whether the application really needs Redis persistence. If necessary, a faster hard drive should be used;
Regardless of whether the previous two steps are effective or not, we should consider limiting the size of memory and AOF files for Redis.

Let’s take a look at the TPS after moving Redis from worker-4 to worker-7:

The TPS is still decreasing and is not as high as before, where it could reach 1000 TPS. This seemingly correct optimization step has led to a decrease in TPS, which is clearly not what we expected.

We still don’t know where the problem is, but our goal has always been to reduce the queue. Therefore, let’s first confirm whether the network queue has decreased before considering how to improve TPS.

Look at the queue on worker-4, there is no value for recv_Q:

Now let’s deal with AOF because even though we moved Redis, TPS did not increase. So, we need to see the impact of AOF.

After turning off AOF, the TPS is as follows:

Overall resources are as follows:

See, there is still an effect, right? We have achieved a stable TPS curve of over 1000.

In this capacity scenario, we have achieved good optimization results after completing analysis in four stages. However, every performance test should have a conclusion. So, we need to take one more action, which is to continue increasing the pressure and see how much the system’s maximum capacity can reach.

Therefore, we entered the analysis of the fifth stage.

Analysis of Phase 5 #

Please note that the most important change in the capacity scenario is the addition of threads. Along with the increase in threads, the amount of parameterized data also changes. In such a scenario with increased threads, we also need to focus on the balanced use of resources. Therefore, after optimizing in Phase 4, let’s first see what the results of this scenario are.

Scenario Execution Data #

The stress data for the scenario is as follows:

In terms of results, it looks good, as the TPS has reached 1700.

Global Monitoring Analysis #

The data from global monitoring is as follows:

From the two graphs above, we can see that under this level of stress, the TPS can reach a maximum of around 1700, and the overall resource utilization of the system is also significant.

After going through the baseline and capacity scenarios, we can now come to a conclusion: The system resources have reached a relatively high level of utilization in this maximum capacity scenario.

Have you ever heard the saying that has been circulating in the performance industry: “Performance optimization is endless.” Therefore, we must choose the key points at which performance projects end.

Take our course case as an example. This system has already reached its technical optimization limit, or the cost of further technical optimization is relatively high (for example, custom development and transformation). If you want to increase the capacity in this situation, all you can do is add nodes and hardware resources and utilize all available hardware resources.

But! Attention to all those involved in performance projects! When we work on performance projects, it is not about optimizing the system to the best possible level, and then designing the overall production resources according to this capacity! It is important to know that the cost of issues arising in the production environment is very high, so we generally increase some redundancy, but the amount of redundancy varies.

In many enterprises, the CPUs used in the production environment do not exceed 20%. Why is there so much redundancy? In my experience, most projects are continuously iterated based on business development, leading to this situation. Just imagine how much wasted resources exist in such a production environment.

Speaking of this, we must also talk about how to evaluate capacity at the architectural level. Because for a system with a fixed number of clients, it is easy to determine the overall capacity. However, for systems with a variable number of clients, in order to withstand sudden increases in business capacity, strict design is required, utilizing techniques such as caching, queuing, rate limiting, circuit breaking, and warm-up.

The design of the overall architectural capacity is not something that can be achieved overnight in any enterprise. It requires multiple iterations and years of version updates, constantly evolving with the development of the business. This is not something that can be fully covered in a single article.

Summary #

After completing the baseline scenario, we moved on to the capacity scenario, which is a significant change. In this scenario, we addressed several issues and ultimately reached the following conclusions:

Phase 1: Analyzed the parameterization issue of the stress testing tool and resolved the problem of decreasing TPS (Transactions Per Second) and increasing response time.

Phase 2: Analyzed the database indexes and resolved the issue of low TPS.

Phase 3: Analyzed resource contention and resolved the problem of multiple containers running on the same node.

Phase 4: Analyzed network contention and Redis AOF (Append-Only File), resolving the issue of unstable TPS.

Phase 5: Increased the stress level to determine the overall system capacity.

After completing these actions, we can finally provide a more definitive conclusion: The TPS can reach 1700!

Please remember, for a performance project, not having a conclusion is tantamount to being dishonest. Therefore, I have always emphasized that in performance projects, we must provide conclusions on maximum capacity.

Homework #

That’s all for today’s content. Finally, I will leave you with two questions to consider:

Why is it necessary for a performance project to have conclusions?
When multiple performance problems occur simultaneously, how do we determine their mutual influence?
How do we determine when a system has been optimized to its optimal state?

Be sure to discuss and exchange your thoughts with me in the comments area. Every thought will take you further.

If you have gained something from this lesson, feel free to share it with your friends and learn and progress together. See you in the next lecture!