18 Why Dynamic Parameterization Logic Is Very Important for Confirming Shopping Cart Information to Order

18 Why dynamic parameterization logic is very important for confirming shopping cart information to order #

Hello, I’m Gao Lou.

Today, let’s take a look at the performance of the interface for confirming orders with shopping cart information and identify areas that need optimization.

In this lesson, I will show you how to perform method-level tracing to identify issues with parameters. These parameters are not directly used in our interface but are used at different levels.

We all understand direct parameterization, right? However, when a parameter generates new data that is used in subsequent actions, you need to pay attention. It’s possible that we didn’t discover any issues in the first layer of data, but we may encounter problems in subsequent actions. Therefore, we must pay attention to the changes in parameterization, that is, dynamically parameterized data.

In addition, in this lesson, I will also show you how to handle complex issues caused by excessive resource consumption in a node when the application has multiple nodes.

Without further ado, let’s start our analysis today!

Scenario Running Data #

For the interface that confirms the order based on shopping cart information, the performance results of our first run are as follows:

From the graph, we can see that the response time increases with the increase in load, and the TPS only reaches a little over 160, which is a bit low. Now we need to consider increasing the TPS.

Please note that this is a typical performance issue where the TPS is low and the response time keeps increasing.

Following the RESAR performance analysis logic, let’s take a look at the architecture diagram for this interface.

View the Architecture Diagram #

As you can see, this interface involves multiple services, and the architecture diagram is more complex than the ones for other interfaces we’ve seen before.

Next, let’s analyze the response time.

Splitting Response Time #

  • Gateway:

  • Order:

  • Member:

  • Cart:

  • Portal:

From the above breakdown, it can be seen that Cart has the longest time consumption. Therefore, let’s analyze Cart first.

Let’s also click on the time consumption between Cart and MySQL to see what is happening:

The time between Cart and MySQL doesn’t seem to be long, so we don’t need to consider the SQL time consumption of the database.

Next, let’s analyze the Cart service with long response time.

Phase 1 #

Overall Analysis #

As usual, let’s take a look at the resource consumption at the worker level:

From the above chart, it seems that worker-3 is consuming more resources. Let’s check what services are running on worker-3.

[root@k8s-master-2 ~]# kubectl get pods -o wide | grep k8s-worker-3
cloud-nacos-registry-685b8957d7-vskb6       1/1     Running     0          2d11h   10.100.69.199    k8s-worker-3   <none>           <none>
cloud-redis-7f7db7f45c-t5g46                2/2     Running     0          2d8h    10.100.69.196    k8s-worker-3   <none>           <none>
elasticsearch-master-2                      1/1     Running     0          3h28m   10.100.69.209    k8s-worker-3   <none>           <none>
svc-mall-cart-558d787dc7-g6qgh              1/1     Running     0          2d11h   10.100.69.201    k8s-worker-3   <none>           <none>
svc-mall-order-fbfd8b57c-kbczh              1/1     Running     0          2d11h   10.100.69.202    k8s-worker-3   <none>           <none>
svc-mall-portal-846d9994f8-m7jbq            1/1     Running     0          38h     10.100.69.207    k8s-worker-3   <none>           <none>
svc-mall-search-c9c8bc847-h7sgv             1/1     Running     0          161m    10.100.69.210    k8s-worker-3   <none>           <none>
[root@k8s-master-2 ~]# 

We can see that there are 8 services running on worker-3. Which service is consuming the most resources? Let’s go into worker-3 and check top:

[root@k8s-worker-3 ~]# top
top - 01:51:35 up 2 days, 12:18,  2 users,  load average: 19.48, 18.40, 17.07
Tasks: 319 total,   1 running, 318 sleeping,   0 stopped,   0 zombie
%Cpu0  : 68.6 us,  6.4 sy,  0.0 ni, 19.9 id,  0.0 wa,  0.0 hi,  5.1 si,  0.0 st
%Cpu1  : 66.7 us,  5.8 sy,  0.0 ni, 22.8 id,  0.0 wa,  0.0 hi,  4.8 si,  0.0 st
%Cpu2  : 66.4 us,  6.1 sy,  0.0 ni, 22.7 id,  0.0 wa,  0.0 hi,  4.7 si,  0.0 st
%Cpu3  : 65.7 us,  5.4 sy,  0.0 ni, 23.6 id,  0.0 wa,  0.0 hi,  5.4 si,  0.0 st
%Cpu4  : 66.6 us,  5.7 sy,  0.0 ni, 22.0 id,  0.0 wa,  0.0 hi,  5.7 si,  0.0 st
%Cpu5  : 67.6 us,  5.8 sy,  0.0 ni, 22.5 id,  0.0 wa,  0.0 hi,  4.1 si,  0.0 st
KiB Mem : 16265992 total,  2525940 free,  7015104 used,  6724948 buff/cache
KiB Swap:        0 total,        0 free,        0 used.  8848464 avail Mem 
    
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                      
32216 root      20   0 8878548 658820  16980 S 280.5  4.1 375:31.82 java -Dapp.id=svc-mall-cart -javaagent:/opt/skywalking/agent/skywalking-agen+
32589 root      20   0 8839408 589196  15892 S  84.1  3.6 171:16.88 java -Dapp.id=svc-mall-order -javaagent:/opt/skywalking/agent/skywalking-age+
24119 root      20   0 8798548 549804  15892 S  65.9  3.4 115:52.74 java -Dapp.id=svc-mall-portal -javaagent:/opt/skywalking/agent/skywalking-ag+
 1089 root      20   0 2438956 105708  37776 S   6.3  0.6 248:21.71 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.co+
 5470 root      20   0 1154816  14992   1816 S   3.6  0.1  20:15.93 redis-server 0.0.0.0:6379

From the above data, it does appear that the Cart service is consuming the highest CPU. However, it has not fully utilized all 6 CPUs, so we should keep that in mind.

Let’s continue with the targeted analysis.

Targeted Analysis #

Since the Cart service is consuming more CPU, it makes sense to look at what threads are doing in the Cart service.

Most of these threads are in a green Runnable state, indicating they are quite busy. It’s possible that we have configured the number of threads too low. Let’s check the configuration:

server:
  port: 8086
  tomcat:
    accept-count: 1000
    threads:
      max: 20
      min-spare: 5
    max-connections: 500

Now that we know the thread configuration for the embedded Tomcat in Spring Boot, let’s analyze the methods currently being executed in Cart to see if our assumptions are correct:

When looking at this chart, pay attention to the positions where the lines are longer on the right side. These are the parts that take longer to execute:

  1. MySQL execution takes a long time. Although the MySQL/JDBI/PreparedStatement/execute does not consume much time, the subsequent Balance/Promotion/Cart/CartPromotion processes consume more time.
  2. The Promotionnew method itself takes a long time.

Since the slow node is related to MySQL, we can use mysqlreport to monitor the overall MySQL performance:

__ Connections _________________________________________________________
Max used          152 of  151      %Max: 100.66
Total             540     0.0/s

It turns out that the connections are full! Let’s quickly increase the maximum connections from 151 to 500.

However, even after retesting, the response time did not change. So we need to continue tracing the methods on Cart.

Method-level tracing #

So, we had to dive into method-level tracing to see where the method “Promotionnew” that we are concerned about is slow.

Based on the previous invocation view, we can write the following tracing statement:

trace -E com.dunshan.mall.cart.controller.CartItemController listPromotionnew -n 5 -v –skipJDKMethod false ‘1==1’

And we obtained the following result:

    `---ts=2021-01-16 15:08:58;thread_name=http-nio-8086-exec-34;id=f8;is_daemon=true;priority=5;TCCL=org.springframework.boot.web.embedded.tomcat.TomcatEmbeddedWebappClassLoader@56887c8f
        `---[97.827186ms] com.dunshan.mall.cart.service.imp.CartItemServiceImpl$$EnhancerBySpringCGLIB$$ac8f5a97:listPromotion()
            `---[97.750962ms] org.springframework.cglib.proxy.MethodInterceptor:intercept() #57
                `---[97.557484ms] com.dunshan.mall.cart.service.imp.CartItemServiceImpl:listPromotion()
                    +---[72.273747ms] com.dunshan.mall.cart.service.imp.CartItemServiceImpl:list() #166
                    +---[0.003516ms] cn.hutool.core.collection.CollUtil:isNotEmpty() #172
                    +---[0.004207ms] java.util.List:stream() #173
                    +---[0.003893ms] java.util.stream.Stream:filter() #57
                    +---[0.003018ms] java.util.stream.Collectors:toList() #57
                    +---[0.060052ms] java.util.stream.Stream:collect() #57
                    +---[0.002017ms] java.util.ArrayList:<init>() #177
                    +---[0.003013ms] org.springframework.util.CollectionUtils:isEmpty() #179
                    `---[25.152532ms] com.dunshan.mall.cart.feign.CartPromotionService:calcCartPromotion() #181

We can see that in the method we traced, com.dunshan.mall.cart.service.imp.CartItemServiceImpl:listPromotion(), there are two places with significant time consumption for listPromotion and calcCartPromotion, which are:

  1. com.dunshan.mall.cart.service.imp.CartItemServiceImpl:list()
  2. com.dunshan.mall.cart.feign.CartPromotionService:calcCartPromotion()

Tracing the List function #

We executed the following trace statement in Arthas to trace the com.dunshan.mall.cart.service.imp.CartItemServiceImpl:list function:

trace com.dunshan.mall.cart.service.imp.CartItemServiceImpl list -v -n 5 --skipJDKMethod false '1==1'

And we obtained the following result:

    `---ts=2021-01-16 15:19:45;thread_name=http-nio-8086-exec-65;id=23ce;is_daemon=true;priority=5;TCCL=org.springframework.boot.web.embedded.tomcat.TomcatEmbeddedWebappClassLoader@56887c8f
        `---[70.158793ms] com.dunshan.mall.cart.service.imp.CartItemServiceImpl:list()
            +---[0.003501ms] com.dunshan.mall.model.OmsCartItemExample:<init>() #150
            +---[0.002642ms] com.dunshan.mall.model.OmsCartItemExample:createCriteria() #151
            +---[0.002932ms] com.dunshan.mall.model.OmsCartItemExample$Criteria:andDeleteStatusEqualTo() #57
            +---[0.00304ms] com.dunshan.mall.model.OmsCartItemExample$Criteria:andMemberIdEqualTo() #57
            `---[70.078976ms] com.dunshan.mall.mapper.OmsCartItemMapper:selectByExample() #152

After a boring trace, we noticed that a select statement took a long time. The select statement is:

select id, product_id, product_sku_id, member_id, quantity, price, product_pic, product_name,     product_sub_title, product_sku_code, member_nickname, create_date, modify_date, delete_status,     product_category_id, product_brand, product_sn, product_attr       from oms_cart_item                 WHERE (  delete_status = ?                                                            and member_id = ? )

Why does a simple select statement take so long? Let’s not focus on why for now and see how many rows the table oms_cart_item contains. When I connected to the database and checked, I found that there were 10612 rows in the oms_cart_item table, which is not a large amount of data.

In addition, I also checked the indexes and found that they exist, and the execution plan also utilizes the indexes. So, we need to consider whether it is related to the data volume. Let’s see how many rows this select statement actually fetches.

After completing the SQL and executing the query, I found that one member_id corresponds to more than 500 rows, which means one person bought 500 items at once?

Since there are so many records for the same person in the shopping cart information, they must have been added during the execution of the performance script because each user’s shopping cart is initially empty. So, we need to investigate how the script adds items to the shopping cart.

The script for adding items to the shopping cart is simple, it’s just a POST request with the product ID, and there is a Token in the request header of the HTTP protocol to identify the user.

Here, we need to check whether the Token is being reused. The Token parametrization configuration in JMeter is as follows:

It looks fine, as we have set up the Token not to be reused in this configuration. So, there is only one possibility: the Token is being reused.

After randomly checking several Tokens, I found that there are many duplicated Tokens. This explains why we see so many items in one person’s shopping cart.

However, this logic is problematic. Think about it, we have set up the parameterization to not reuse data, but in reality, there are many Tokens being reused, which means that the Token parametrization file itself is duplicated.

What can we do about this? We have to clear all the Tokens so that they won’t be reused when adding products to the shopping cart, thereby avoiding the situation where too many items are added to one person’s shopping cart.

What’s next? We only have one choice, which is to clear all the data and create reasonable shopping cart data for all users. So, I spent several hours here, creating more than 1.3 million data. Now, let’s go back to the scenario:

As you can see, TPS has increased to 300!

Originally, this was a result of success, where we could have a break and enjoy afternoon tea. However… things don’t always go as planned. During the execution of the scenario, I found another problem, which forced us to start the second phase of analysis.

Stage 2 #

Scenario Running Data #

What’s the problem? I found this TPS curve in the stress running data:

As you can see, the TPS drops regularly, not only dropping but also recovering, forming an obvious zigzag shape, and the zigzag is quite large.

What should we do? According to High’s ideas, we need to deal with this problem according to the RESAR performance analysis logic. We have already seen the architecture diagram earlier, so now let’s directly analyze the response time.

Analyzing the Response Time #

  • Gateway:

  • Order:

  • Cart:

  • Portal:

From the data above, it seems that every layer has a time consumption issue, and the performance is not good for all layers.

Global Analysis #

Let’s check the current global monitoring data. It can be seen that worker-3 has the highest CPU usage:

Therefore, let’s check which Pods are on worker-3:

[root@k8s-master-3 ~]# kubectl get pods -o wide | grep k8s-worker-3
cloud-nacos-registry-685b8957d7-vskb6       1/1     Running     0          3d7h    10.100.69.199    k8s-worker-3   <none>           <none>
cloud-redis-7f7db7f45c-t5g46                2/2     Running     1          3d4h    10.100.69.196    k8s-worker-3   <none>           <none>
elasticsearch-master-2                      1/1     Running     0          23h     10.100.69.209    k8s-worker-3   <none>           <none>
svc-mall-cart-79c667bf56-j76h6              1/1     Running     0          20h     10.100.69.213    k8s-worker-3   <none>           <none>
svc-mall-order-fbfd8b57c-kbczh              1/1     Running     0          3d7h    10.100.69.202    k8s-worker-3   <none>           <none>
svc-mall-portal-846d9994f8-m7jbq            1/1     Running     0          2d10h   10.100.69.207    k8s-worker-3   <none>           <none>
svc-mall-search-c9c8bc847-h7sgv             1/1     Running     0          23h     10.100.69.210    k8s-worker-3   <none>           <none>
[root@k8s-master-3 ~]#

It turns out that there are so many services on worker-3.

Now let’s log in to worker-3 and check the top resources. Actually, what I want to see here is the process table, because I want to first determine which service consumes the most resources, and then decide which service to deal with.

[root@k8s-worker-3 ~]# top
top - 22:13:01 up 3 days,  8:39,  3 users,  load average: 40.34, 30.03, 18.02
Tasks: 326 total,   6 running, 320 sleeping,   0 stopped,   0 zombie
%Cpu0  : 74.5 us, 13.4 sy,  0.0 ni,  7.7 id,  0.0 wa,  0.0 hi,  4.4 si,  0.0 st
%Cpu1  : 66.3 us, 12.1 sy,  0.0 ni, 16.5 id,  0.0 wa,  0.0 hi,  4.7 si,  0.3 st
%Cpu2  : 49.7 us, 32.4 sy,  0.0 ni, 14.9 id,  0.0 wa,  0.0 hi,  2.7 si,  0.3 st
%Cpu3  : 73.2 us,  9.7 sy,  0.0 ni, 12.4 id,  0.0 wa,  0.0 hi,  4.7 si,  0.0 st
%Cpu4  : 76.4 us, 10.5 sy,  0.0 ni,  8.8 id,  0.0 wa,  0.0 hi,  4.1 si,  0.3 st
%Cpu5  : 62.4 us, 16.4 sy,  0.0 ni, 16.1 id,  0.0 wa,  0.0 hi,  4.7 si,  0.3 st
KiB Mem : 16265992 total,   211212 free,  9204800 used,  6849980 buff/cache
KiB Swap:        0 total,        0 free,        0 used.  6650068 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                                  
32485 root      20   0 8895760 700564  16896 S 101.6  4.3 723:03.52 java -Dapp.id=svc-mall-cart -javaagent:/opt/skywalking/agent/skywalking-agent.jar -Dskywalking.agent.service_name=svc-ma+
32589 root      20   0 8845576 778684  15896 S  93.6  4.8 427:04.44 java -Dapp.id=svc-mall-order -javaagent:/opt/skywalking/agent/skywalking-agent.jar -Dskywalking.agent.service_name=svc-m+
24119 root      20   0 8825208 600956  15888 S  67.9  3.7 262:00.83 java -Dapp.id=svc-mall-portal -javaagent:/opt/skywalking/agent/skywalking-agent.jar -Dskywalking.agent.service_name=svc-+
............

In the above top resources, we mainly focus on several CPU-intensive processes. It’s not difficult to find that the Cart/Order/Portal services are used in the business of determining shopping cart information and orders, and they are all on the same worker. At the same time, we can see that the CPU queue in this 6C worker has reached 40.

Targeted Analysis #

From a system perspective, the problem of a long CPU queue is mainly due to resource contention, as threads are constantly being awakened and can be seen through start_thread:

Now what we need to do is bring the threads down.

How do we bring them down? There are two ways:

  1. Move the services away, tidy up one by one, divide and conquer.
  2. Reduce the threads of the services.

Both methods can reduce resource contention, but they will have different effects. The first method is more reasonable, but it will consume more overall resources; the second method will reduce contention but will increase response time.

As I explain to you, you may have already noticed that neither of these methods can explain the unstable TPS. So why does the TPS drop and then recover? We still don’t know the answer yet, but based on our “global-directed” analysis approach, let’s first look at the resource consumption of worker-3:

During the same time period, I also checked the resource consumption of other workers on the same physical machine and found that the resource consumption of worker-8 is a bit abnormal. Please note that my current examination logic still follows the logic described in Lesson 3, as well as the performance analysis decision tree described in Lesson 4. I hope you don’t think there is a leap here. In fact, we are still at the first level counter of global monitoring.

Let’s take a closer look at the resource consumption of worker-8:

Now let’s take a look at the execution data of the stress scenario:

From the resource utilization of worker-8 above, there are indeed high moments. Considering the problem of resource contention on the same physical machine, let us move the cart to worker-7 and the order to worker-9, and then look at the TPS:

Colorful, ups and downs, it’s really nice… Although we don’t care about whether this will increase TPS for now, just looking at this trend is enough to break one’s heart. Since the result is like this, let’s use the old method and continue to split the time to see.

  • Gateway:

  • Order:

  • Cart:

  • Portal:

  • Member:

From the time shown above, the Gateway consumes more time, which is strange, as there should be no problem with the Gateway service. So let’s take a look at what services are running on the Gateway machine:

Oh, what’s that at the top? It turns out to be a process of ES, which consumes multiple CPUs. Seeing this, I remembered that a few days ago, in order to increase the capacity of ES, we added CPUs to ES data and ES client. Considering that they both consume a lot of CPU, it is too unfair to give them only one CPU, so we increased the number of CPUs to improve their performance.

But I didn’t expect ES data and ES client to have such a big impact on the application.

I adjusted the CPU for ES at that time because a search service in our architecture used it, and at that time, the CPU given to it was only one core, which resulted in a low TPS for the search service. I described this in Lesson 15. If you are not clear, you can take a look again.

At the same time, we also need to note that ES data and ES client are not single nodes but several nodes. The problem caused by this is that when any ES node consumes too many resources, it will affect the resources of the worker machine where it is located, which in turn affects the entire physical machine where this ES node is located.

Since the process of ES consumes the most resources, what should we do? To validate the problem, I will first stop ES and see if TPS can increase. If it can, then we will consider how to restrict the resources of ES.

After stopping ES, the TPS is as shown in the following figure:

Do you see it? TPS has doubled and has not dropped! It’s very ideal!

So next, we need to consider limiting ES to a fixed worker so that it does not affect the current application.

Summary #

In this lesson, we had two stages of analysis.

In the first stage, we identified data issues. For performance, data is a very important foundational resource, and the correctness of the data directly affects the results of testing.

There is often a discussion among those new to the performance industry: how much data should be used in performance scripts? I think this is a very clear question, in all performance scenarios, the resources used should be determined based on the actual business logic, and only in this way can we ensure that the results are valid.

In the second stage, we identified an interesting problem, and the complexity of this problem lies in the overall architecture. Because we are using KVM, Kubernetes, and Docker as the infrastructure, and the application we chose was not originally a true microservice, but rather a modification of an open-source system architecture to make it a true microservice.

In such an environment, if an application has an issue, when it is modified and restarted, it is uncertain which node Kubernetes will schedule the application to. It is for this reason that we sometimes see problems here, and sometimes see problems there, and all the logic for identifying the issue is correct, but the level is wrong. This is also the reason for the random issues that occurred in the previous lesson.

Therefore, according to the performance analysis decision tree we mentioned in Lesson 4, we still need to have a global monitoring, targeted monitoring mindset, and we need to find the correlation between counters. This way, when we see issues with related counters, we will know the relationship between them.

I hope you can keep the logic of performance analysis in mind from this lesson.

Homework #

Finally, please give some thought to the following questions:

  1. How should the parameters in a performance script be designed?
  2. How would you locate the scenario when the TPS drops? Briefly describe your approach.

Remember to discuss and share your thoughts with me in the comments section. Each reflection will help you make further progress.

If you have gained something from reading this article, feel free to share it with your friends and learn together. See you in the next class!