18 Why dynamic parameterization logic is very important for confirming shopping cart information to order #
Hello, I’m Gao Lou.
Today, let’s take a look at the performance of the interface for confirming orders with shopping cart information and identify areas that need optimization.
In this lesson, I will show you how to perform method-level tracing to identify issues with parameters. These parameters are not directly used in our interface but are used at different levels.
We all understand direct parameterization, right? However, when a parameter generates new data that is used in subsequent actions, you need to pay attention. It’s possible that we didn’t discover any issues in the first layer of data, but we may encounter problems in subsequent actions. Therefore, we must pay attention to the changes in parameterization, that is, dynamically parameterized data.
In addition, in this lesson, I will also show you how to handle complex issues caused by excessive resource consumption in a node when the application has multiple nodes.
Without further ado, let’s start our analysis today!
Scenario Running Data #
For the interface that confirms the order based on shopping cart information, the performance results of our first run are as follows:
From the graph, we can see that the response time increases with the increase in load, and the TPS only reaches a little over 160, which is a bit low. Now we need to consider increasing the TPS.
Please note that this is a typical performance issue where the TPS is low and the response time keeps increasing.
Following the RESAR performance analysis logic, let’s take a look at the architecture diagram for this interface.
View the Architecture Diagram #
As you can see, this interface involves multiple services, and the architecture diagram is more complex than the ones for other interfaces we’ve seen before.
Next, let’s analyze the response time.
Splitting Response Time #
- Gateway:
- Order:
- Member:
- Cart:
- Portal:
From the above breakdown, it can be seen that Cart has the longest time consumption. Therefore, let’s analyze Cart first.
Let’s also click on the time consumption between Cart and MySQL to see what is happening:
The time between Cart and MySQL doesn’t seem to be long, so we don’t need to consider the SQL time consumption of the database.
Next, let’s analyze the Cart service with long response time.
Phase 1 #
Overall Analysis #
As usual, let’s take a look at the resource consumption at the worker level:
From the above chart, it seems that worker-3 is consuming more resources. Let’s check what services are running on worker-3.
[root@k8s-master-2 ~]# kubectl get pods -o wide | grep k8s-worker-3
cloud-nacos-registry-685b8957d7-vskb6 1/1 Running 0 2d11h 10.100.69.199 k8s-worker-3 <none> <none>
cloud-redis-7f7db7f45c-t5g46 2/2 Running 0 2d8h 10.100.69.196 k8s-worker-3 <none> <none>
elasticsearch-master-2 1/1 Running 0 3h28m 10.100.69.209 k8s-worker-3 <none> <none>
svc-mall-cart-558d787dc7-g6qgh 1/1 Running 0 2d11h 10.100.69.201 k8s-worker-3 <none> <none>
svc-mall-order-fbfd8b57c-kbczh 1/1 Running 0 2d11h 10.100.69.202 k8s-worker-3 <none> <none>
svc-mall-portal-846d9994f8-m7jbq 1/1 Running 0 38h 10.100.69.207 k8s-worker-3 <none> <none>
svc-mall-search-c9c8bc847-h7sgv 1/1 Running 0 161m 10.100.69.210 k8s-worker-3 <none> <none>
[root@k8s-master-2 ~]#
We can see that there are 8 services running on worker-3. Which service is consuming the most resources? Let’s go into worker-3 and check top
:
[root@k8s-worker-3 ~]# top
top - 01:51:35 up 2 days, 12:18, 2 users, load average: 19.48, 18.40, 17.07
Tasks: 319 total, 1 running, 318 sleeping, 0 stopped, 0 zombie
%Cpu0 : 68.6 us, 6.4 sy, 0.0 ni, 19.9 id, 0.0 wa, 0.0 hi, 5.1 si, 0.0 st
%Cpu1 : 66.7 us, 5.8 sy, 0.0 ni, 22.8 id, 0.0 wa, 0.0 hi, 4.8 si, 0.0 st
%Cpu2 : 66.4 us, 6.1 sy, 0.0 ni, 22.7 id, 0.0 wa, 0.0 hi, 4.7 si, 0.0 st
%Cpu3 : 65.7 us, 5.4 sy, 0.0 ni, 23.6 id, 0.0 wa, 0.0 hi, 5.4 si, 0.0 st
%Cpu4 : 66.6 us, 5.7 sy, 0.0 ni, 22.0 id, 0.0 wa, 0.0 hi, 5.7 si, 0.0 st
%Cpu5 : 67.6 us, 5.8 sy, 0.0 ni, 22.5 id, 0.0 wa, 0.0 hi, 4.1 si, 0.0 st
KiB Mem : 16265992 total, 2525940 free, 7015104 used, 6724948 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 8848464 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
32216 root 20 0 8878548 658820 16980 S 280.5 4.1 375:31.82 java -Dapp.id=svc-mall-cart -javaagent:/opt/skywalking/agent/skywalking-agen+
32589 root 20 0 8839408 589196 15892 S 84.1 3.6 171:16.88 java -Dapp.id=svc-mall-order -javaagent:/opt/skywalking/agent/skywalking-age+
24119 root 20 0 8798548 549804 15892 S 65.9 3.4 115:52.74 java -Dapp.id=svc-mall-portal -javaagent:/opt/skywalking/agent/skywalking-ag+
1089 root 20 0 2438956 105708 37776 S 6.3 0.6 248:21.71 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.co+
5470 root 20 0 1154816 14992 1816 S 3.6 0.1 20:15.93 redis-server 0.0.0.0:6379
From the above data, it does appear that the Cart service is consuming the highest CPU. However, it has not fully utilized all 6 CPUs, so we should keep that in mind.
Let’s continue with the targeted analysis.
Targeted Analysis #
Since the Cart service is consuming more CPU, it makes sense to look at what threads are doing in the Cart service.
Most of these threads are in a green Runnable state, indicating they are quite busy. It’s possible that we have configured the number of threads too low. Let’s check the configuration:
server:
port: 8086
tomcat:
accept-count: 1000
threads:
max: 20
min-spare: 5
max-connections: 500
Now that we know the thread configuration for the embedded Tomcat in Spring Boot, let’s analyze the methods currently being executed in Cart to see if our assumptions are correct:
When looking at this chart, pay attention to the positions where the lines are longer on the right side. These are the parts that take longer to execute:
- MySQL execution takes a long time. Although the MySQL/JDBI/PreparedStatement/execute does not consume much time, the subsequent Balance/Promotion/Cart/CartPromotion processes consume more time.
- The Promotionnew method itself takes a long time.
Since the slow node is related to MySQL, we can use mysqlreport
to monitor the overall MySQL performance:
__ Connections _________________________________________________________
Max used 152 of 151 %Max: 100.66
Total 540 0.0/s
It turns out that the connections are full! Let’s quickly increase the maximum connections from 151 to 500.
However, even after retesting, the response time did not change. So we need to continue tracing the methods on Cart.
Method-level tracing #
So, we had to dive into method-level tracing to see where the method “Promotionnew” that we are concerned about is slow.
Based on the previous invocation view, we can write the following tracing statement:
trace -E com.dunshan.mall.cart.controller.CartItemController listPromotionnew -n 5 -v –skipJDKMethod false ‘1==1’
And we obtained the following result:
`---ts=2021-01-16 15:08:58;thread_name=http-nio-8086-exec-34;id=f8;is_daemon=true;priority=5;TCCL=org.springframework.boot.web.embedded.tomcat.TomcatEmbeddedWebappClassLoader@56887c8f
`---[97.827186ms] com.dunshan.mall.cart.service.imp.CartItemServiceImpl$$EnhancerBySpringCGLIB$$ac8f5a97:listPromotion()
`---[97.750962ms] org.springframework.cglib.proxy.MethodInterceptor:intercept() #57
`---[97.557484ms] com.dunshan.mall.cart.service.imp.CartItemServiceImpl:listPromotion()
+---[72.273747ms] com.dunshan.mall.cart.service.imp.CartItemServiceImpl:list() #166
+---[0.003516ms] cn.hutool.core.collection.CollUtil:isNotEmpty() #172
+---[0.004207ms] java.util.List:stream() #173
+---[0.003893ms] java.util.stream.Stream:filter() #57
+---[0.003018ms] java.util.stream.Collectors:toList() #57
+---[0.060052ms] java.util.stream.Stream:collect() #57
+---[0.002017ms] java.util.ArrayList:<init>() #177
+---[0.003013ms] org.springframework.util.CollectionUtils:isEmpty() #179
`---[25.152532ms] com.dunshan.mall.cart.feign.CartPromotionService:calcCartPromotion() #181
We can see that in the method we traced, com.dunshan.mall.cart.service.imp.CartItemServiceImpl:listPromotion()
, there are two places with significant time consumption for listPromotion
and calcCartPromotion
, which are:
com.dunshan.mall.cart.service.imp.CartItemServiceImpl:list()
com.dunshan.mall.cart.feign.CartPromotionService:calcCartPromotion()
Tracing the List function #
We executed the following trace statement in Arthas to trace the com.dunshan.mall.cart.service.imp.CartItemServiceImpl:list
function:
trace com.dunshan.mall.cart.service.imp.CartItemServiceImpl list -v -n 5 --skipJDKMethod false '1==1'
And we obtained the following result:
`---ts=2021-01-16 15:19:45;thread_name=http-nio-8086-exec-65;id=23ce;is_daemon=true;priority=5;TCCL=org.springframework.boot.web.embedded.tomcat.TomcatEmbeddedWebappClassLoader@56887c8f
`---[70.158793ms] com.dunshan.mall.cart.service.imp.CartItemServiceImpl:list()
+---[0.003501ms] com.dunshan.mall.model.OmsCartItemExample:<init>() #150
+---[0.002642ms] com.dunshan.mall.model.OmsCartItemExample:createCriteria() #151
+---[0.002932ms] com.dunshan.mall.model.OmsCartItemExample$Criteria:andDeleteStatusEqualTo() #57
+---[0.00304ms] com.dunshan.mall.model.OmsCartItemExample$Criteria:andMemberIdEqualTo() #57
`---[70.078976ms] com.dunshan.mall.mapper.OmsCartItemMapper:selectByExample() #152
After a boring trace, we noticed that a select statement took a long time. The select statement is:
select id, product_id, product_sku_id, member_id, quantity, price, product_pic, product_name, product_sub_title, product_sku_code, member_nickname, create_date, modify_date, delete_status, product_category_id, product_brand, product_sn, product_attr from oms_cart_item WHERE ( delete_status = ? and member_id = ? )
Why does a simple select statement take so long? Let’s not focus on why for now and see how many rows the table oms_cart_item
contains. When I connected to the database and checked, I found that there were 10612 rows in the oms_cart_item
table, which is not a large amount of data.
In addition, I also checked the indexes and found that they exist, and the execution plan also utilizes the indexes. So, we need to consider whether it is related to the data volume. Let’s see how many rows this select statement actually fetches.
After completing the SQL and executing the query, I found that one member_id
corresponds to more than 500 rows, which means one person bought 500 items at once?
Since there are so many records for the same person in the shopping cart information, they must have been added during the execution of the performance script because each user’s shopping cart is initially empty. So, we need to investigate how the script adds items to the shopping cart.
The script for adding items to the shopping cart is simple, it’s just a POST request with the product ID, and there is a Token in the request header of the HTTP protocol to identify the user.
Here, we need to check whether the Token is being reused. The Token parametrization configuration in JMeter is as follows:
It looks fine, as we have set up the Token not to be reused in this configuration. So, there is only one possibility: the Token is being reused.
After randomly checking several Tokens, I found that there are many duplicated Tokens. This explains why we see so many items in one person’s shopping cart.
However, this logic is problematic. Think about it, we have set up the parameterization to not reuse data, but in reality, there are many Tokens being reused, which means that the Token parametrization file itself is duplicated.
What can we do about this? We have to clear all the Tokens so that they won’t be reused when adding products to the shopping cart, thereby avoiding the situation where too many items are added to one person’s shopping cart.
What’s next? We only have one choice, which is to clear all the data and create reasonable shopping cart data for all users. So, I spent several hours here, creating more than 1.3 million data. Now, let’s go back to the scenario:
As you can see, TPS has increased to 300!
Originally, this was a result of success, where we could have a break and enjoy afternoon tea. However… things don’t always go as planned. During the execution of the scenario, I found another problem, which forced us to start the second phase of analysis.
Stage 2 #
Scenario Running Data #
What’s the problem? I found this TPS curve in the stress running data:
As you can see, the TPS drops regularly, not only dropping but also recovering, forming an obvious zigzag shape, and the zigzag is quite large.
What should we do? According to High’s ideas, we need to deal with this problem according to the RESAR performance analysis logic. We have already seen the architecture diagram earlier, so now let’s directly analyze the response time.
Analyzing the Response Time #
- Gateway:
- Order:
- Cart:
- Portal:
From the data above, it seems that every layer has a time consumption issue, and the performance is not good for all layers.
Global Analysis #
Let’s check the current global monitoring data. It can be seen that worker-3 has the highest CPU usage:
Therefore, let’s check which Pods are on worker-3:
[root@k8s-master-3 ~]# kubectl get pods -o wide | grep k8s-worker-3
cloud-nacos-registry-685b8957d7-vskb6 1/1 Running 0 3d7h 10.100.69.199 k8s-worker-3 <none> <none>
cloud-redis-7f7db7f45c-t5g46 2/2 Running 1 3d4h 10.100.69.196 k8s-worker-3 <none> <none>
elasticsearch-master-2 1/1 Running 0 23h 10.100.69.209 k8s-worker-3 <none> <none>
svc-mall-cart-79c667bf56-j76h6 1/1 Running 0 20h 10.100.69.213 k8s-worker-3 <none> <none>
svc-mall-order-fbfd8b57c-kbczh 1/1 Running 0 3d7h 10.100.69.202 k8s-worker-3 <none> <none>
svc-mall-portal-846d9994f8-m7jbq 1/1 Running 0 2d10h 10.100.69.207 k8s-worker-3 <none> <none>
svc-mall-search-c9c8bc847-h7sgv 1/1 Running 0 23h 10.100.69.210 k8s-worker-3 <none> <none>
[root@k8s-master-3 ~]#
It turns out that there are so many services on worker-3.
Now let’s log in to worker-3 and check the top resources. Actually, what I want to see here is the process table, because I want to first determine which service consumes the most resources, and then decide which service to deal with.
[root@k8s-worker-3 ~]# top
top - 22:13:01 up 3 days, 8:39, 3 users, load average: 40.34, 30.03, 18.02
Tasks: 326 total, 6 running, 320 sleeping, 0 stopped, 0 zombie
%Cpu0 : 74.5 us, 13.4 sy, 0.0 ni, 7.7 id, 0.0 wa, 0.0 hi, 4.4 si, 0.0 st
%Cpu1 : 66.3 us, 12.1 sy, 0.0 ni, 16.5 id, 0.0 wa, 0.0 hi, 4.7 si, 0.3 st
%Cpu2 : 49.7 us, 32.4 sy, 0.0 ni, 14.9 id, 0.0 wa, 0.0 hi, 2.7 si, 0.3 st
%Cpu3 : 73.2 us, 9.7 sy, 0.0 ni, 12.4 id, 0.0 wa, 0.0 hi, 4.7 si, 0.0 st
%Cpu4 : 76.4 us, 10.5 sy, 0.0 ni, 8.8 id, 0.0 wa, 0.0 hi, 4.1 si, 0.3 st
%Cpu5 : 62.4 us, 16.4 sy, 0.0 ni, 16.1 id, 0.0 wa, 0.0 hi, 4.7 si, 0.3 st
KiB Mem : 16265992 total, 211212 free, 9204800 used, 6849980 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 6650068 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
32485 root 20 0 8895760 700564 16896 S 101.6 4.3 723:03.52 java -Dapp.id=svc-mall-cart -javaagent:/opt/skywalking/agent/skywalking-agent.jar -Dskywalking.agent.service_name=svc-ma+
32589 root 20 0 8845576 778684 15896 S 93.6 4.8 427:04.44 java -Dapp.id=svc-mall-order -javaagent:/opt/skywalking/agent/skywalking-agent.jar -Dskywalking.agent.service_name=svc-m+
24119 root 20 0 8825208 600956 15888 S 67.9 3.7 262:00.83 java -Dapp.id=svc-mall-portal -javaagent:/opt/skywalking/agent/skywalking-agent.jar -Dskywalking.agent.service_name=svc-+
............
In the above top resources, we mainly focus on several CPU-intensive processes. It’s not difficult to find that the Cart/Order/Portal services are used in the business of determining shopping cart information and orders, and they are all on the same worker. At the same time, we can see that the CPU queue in this 6C worker has reached 40.
Targeted Analysis #
From a system perspective, the problem of a long CPU queue is mainly due to resource contention, as threads are constantly being awakened and can be seen through start_thread
:
Now what we need to do is bring the threads down.
How do we bring them down? There are two ways:
- Move the services away, tidy up one by one, divide and conquer.
- Reduce the threads of the services.
Both methods can reduce resource contention, but they will have different effects. The first method is more reasonable, but it will consume more overall resources; the second method will reduce contention but will increase response time.
As I explain to you, you may have already noticed that neither of these methods can explain the unstable TPS. So why does the TPS drop and then recover? We still don’t know the answer yet, but based on our “global-directed” analysis approach, let’s first look at the resource consumption of worker-3:
During the same time period, I also checked the resource consumption of other workers on the same physical machine and found that the resource consumption of worker-8 is a bit abnormal. Please note that my current examination logic still follows the logic described in Lesson 3, as well as the performance analysis decision tree described in Lesson 4. I hope you don’t think there is a leap here. In fact, we are still at the first level counter of global monitoring.
Let’s take a closer look at the resource consumption of worker-8:
Now let’s take a look at the execution data of the stress scenario:
From the resource utilization of worker-8 above, there are indeed high moments. Considering the problem of resource contention on the same physical machine, let us move the cart to worker-7 and the order to worker-9, and then look at the TPS:
Colorful, ups and downs, it’s really nice… Although we don’t care about whether this will increase TPS for now, just looking at this trend is enough to break one’s heart. Since the result is like this, let’s use the old method and continue to split the time to see.
- Gateway:
- Order:
- Cart:
- Portal:
- Member:
From the time shown above, the Gateway consumes more time, which is strange, as there should be no problem with the Gateway service. So let’s take a look at what services are running on the Gateway machine:
Oh, what’s that at the top? It turns out to be a process of ES, which consumes multiple CPUs. Seeing this, I remembered that a few days ago, in order to increase the capacity of ES, we added CPUs to ES data and ES client. Considering that they both consume a lot of CPU, it is too unfair to give them only one CPU, so we increased the number of CPUs to improve their performance.
But I didn’t expect ES data and ES client to have such a big impact on the application.
I adjusted the CPU for ES at that time because a search service in our architecture used it, and at that time, the CPU given to it was only one core, which resulted in a low TPS for the search service. I described this in Lesson 15. If you are not clear, you can take a look again.
At the same time, we also need to note that ES data and ES client are not single nodes but several nodes. The problem caused by this is that when any ES node consumes too many resources, it will affect the resources of the worker machine where it is located, which in turn affects the entire physical machine where this ES node is located.
Since the process of ES consumes the most resources, what should we do? To validate the problem, I will first stop ES and see if TPS can increase. If it can, then we will consider how to restrict the resources of ES.
After stopping ES, the TPS is as shown in the following figure:
Do you see it? TPS has doubled and has not dropped! It’s very ideal!
So next, we need to consider limiting ES to a fixed worker so that it does not affect the current application.
Summary #
In this lesson, we had two stages of analysis.
In the first stage, we identified data issues. For performance, data is a very important foundational resource, and the correctness of the data directly affects the results of testing.
There is often a discussion among those new to the performance industry: how much data should be used in performance scripts? I think this is a very clear question, in all performance scenarios, the resources used should be determined based on the actual business logic, and only in this way can we ensure that the results are valid.
In the second stage, we identified an interesting problem, and the complexity of this problem lies in the overall architecture. Because we are using KVM, Kubernetes, and Docker as the infrastructure, and the application we chose was not originally a true microservice, but rather a modification of an open-source system architecture to make it a true microservice.
In such an environment, if an application has an issue, when it is modified and restarted, it is uncertain which node Kubernetes will schedule the application to. It is for this reason that we sometimes see problems here, and sometimes see problems there, and all the logic for identifying the issue is correct, but the level is wrong. This is also the reason for the random issues that occurred in the previous lesson.
Therefore, according to the performance analysis decision tree we mentioned in Lesson 4, we still need to have a global monitoring, targeted monitoring mindset, and we need to find the correlation between counters. This way, when we see issues with related counters, we will know the relationship between them.
I hope you can keep the logic of performance analysis in mind from this lesson.
Homework #
Finally, please give some thought to the following questions:
- How should the parameters in a performance script be designed?
- How would you locate the scenario when the TPS drops? Briefly describe your approach.
Remember to discuss and share your thoughts with me in the comments section. Each reflection will help you make further progress.
If you have gained something from reading this article, feel free to share it with your friends and learn together. See you in the next class!