16 Goods Added to Shopping Cart SQL Optimization and Parameter Analysis in Stress Tools

16 Goods added to shopping cart SQL optimization and parameter analysis in stress tools #

Hello, I am Gao Lou.

In today’s lesson, I will talk to you about SQL optimization and parameter analysis in load testing tools using the example of the interface for adding products to the shopping cart.

When it comes to SQL optimization, many people tend to suspect that there is a problem with the SQL when they see high database resource utilization. Although this direction seems correct, it is often difficult to determine which exact SQL query is causing the problem and what kind of problem it is. Therefore, in this lesson, I will teach you how to quickly identify problematic SQL queries based on high resource utilization and make appropriate adjustments. In addition, you will learn how to handle database locks caused by improper use of parameters in load testing tools.

Now, let’s begin the analysis for this lesson.

Stress Data #

For the interface of adding products to the shopping cart, the performance results of our first run are as follows:

Looking at it, I feel like crying. Do you feel the same way? From this chart, there is more than one problem. With my limited professional experience, I opened my spiritual eye and felt that there are two problems here:

  • Even at its peak, the TPS is not high, only around 50;
  • During peak times, there are a lot of errors being generated.

So which problem is more important? Some people may say that obviously we should handle the errors, isn’t it dizzying to see so many errors? If you have OCD, then there is no way around it, you can handle the errors first.

However, in my opinion, it is also possible to address the issue of low TPS first. Because even though there are errors, not all of them are wrong, only 5% of them, so what’s the rush.

Nevertheless, no matter what, we still need to follow the thinking process of performance analysis decision tree.

Look at the architecture diagram #

The logic of this interface is clear: Pressure tool - Gateway - Cart - Member.

I plan to analyze the problem of low TPS and increased response time, which can be observed in the first half of the stress curve. Therefore, our analysis will start with breaking down the response time.

If you want to handle errors first in such a scenario, start by checking the logs. In fact, these errors are easy to deal with because they provide very clear directions.

Analysis Phase 1 #

Breakdown of Response Time #

This time let’s take a look at the screenshots.

  • User - Gateway:

  • Gateway - Cart:

  • Cart - Member:

  • Cart - MySQL:

  • Member - MySQL:

Based on the response time, we need to address MySQL first, specifically the SQL related to the Cart service, because the response time between Cart and MySQL is a bit long.

Global Analysis #

As usual, let’s take a look at the overall monitoring.

Since the CPU usage on worker-1 is high, let’s see what services are running on worker-1.

You may ask why the network download bandwidth is also highlighted in red and has exceeded 100Mb. This has to do with how to interpret the counters. Although the network is highlighted in red here, it is only slightly over 100Mb, and it’s highlighted in red because of the threshold setting in Grafana Dashboard. If you don’t want it to be highlighted in red, you can set a higher threshold. Besides, for the network, slightly over 100Mb is not considered high.

Let’s see what’s on worker-1.

[root@k8s-master-2 ~]# kubectl get pods -o wide|grep k8s-worker-1
elasticsearch-data-1                        1/1     Running   1          11d     10.100.230.57    k8s-worker-1   <none>           <none>
elasticsearch-master-0                      1/1     Running   0          3d11h   10.100.230.60    k8s-worker-1   <none>           <none>
mysql-min-d564fc4df-vs7d6                   1/1     Running   0          22h     10.100.230.1     k8s-worker-1   <none>           <none>
[root@k8s-master-2 ~]# 

As you can see, there is not only MySQL but also ES data running on worker-1, which consumes a lot of network resources. However, the issue is not currently pointing to it.

Since we saw earlier that the response time of MySQL is long, let’s continue to look at the data in the global monitoring on worker-1.

[root@k8s-worker-1 ~]# top
top - 23:08:21 up 3 days, 11:30,  5 users,  load average: 29.90, 28.54, 23.00
Tasks: 309 total,   1 running, 307 sleeping,   0 stopped,   1 zombie
%Cpu0  : 94.1 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  2.9 si,  2.9 st
%Cpu1  : 94.1 us,  2.9 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  2.9 si,  0.0 st
%Cpu2  : 90.9 us,  3.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  6.1 st
%Cpu3  : 89.7 us,  3.4 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  3.4 si,  3.4 st
%Cpu4  : 87.9 us,  6.1 sy,  0.0 ni,  3.0 id,  0.0 wa,  0.0 hi,  0.0 si,  3.0 st
%Cpu5  : 87.9 us,  9.1 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  3.0 st
```
KiB Mem : 16265992 total,  1176564 free,  8436112 used,  6653316 buff/cache
KiB Swap:        0 total,        0 free,        0 used.  7422832 avail Mem 

PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
21344 27        20   0 8222204 628452  12892 S 331.4  3.9 141:36.72 /opt/rh/rh-mysql57/root/usr/libexec/mysqld --defaults-file=/etc/my.cnf
5128 techstar  20   0 5917564   1.4g  21576 S 114.3  8.8 233:09.48 /usr/share/elasticsearch/jdk/bin/java -Xshare:auto -Des.networkaddress.cache+
5127 techstar  20   0   14.1g   3.5g  25756 S  40.0 22.8   1647:28 /usr/share/elasticsearch/jdk/bin/java -Xshare:auto -Des.networkaddress.cache+
1091 root      20   0 1145528 108228  29420 S  25.7  0.7 263:51.49 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock
1078 root      20   0 2504364 106288  38808 S  14.3  0.7 429:13.57 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.co+
17108 root      20   0  164472   2656   1712 R  14.3  0.0   0:00.66 top  
```

From the data above, we can see that the MySQL process consumes a lot of CPU, which indicates that our evidence chain is correct. Since we have reached the database, what should we look at? Of course, we should focus on MySQL's global monitoring. So, I printed the MySQL Report and filtered out some unnecessary data to obtain the following results (otherwise the content would be too long):

```
__ Questions ___________________________________________________________
Total         637.05k     8.0/s
  DMS         293.57k     3.7/s  %Total:  46.08
  Com_        235.02k     2.9/s           36.89
.............................
Slow 20 ms    119.50k     1.5/s           18.76  %DMS:  40.70  Log:
DMS           293.57k     3.7/s           46.08
  SELECT      224.80k     2.8/s           35.29         76.57
  UPDATE       51.86k     0.6/s            8.14         17.66
  INSERT       16.92k     0.2/s            2.66          5.76
  REPLACE           0       0/s            0.00          0.00
  DELETE            0       0/s            0.00          0.00
.............................

__ SELECT and Sort _____________________________________________________
Scan          137.84k     1.7/s %SELECT:  61.32
.............................
```

From the data above, we can see that in the "Total" section, DMS (Data Manipulation Statements) accounts for 46.08%. In the DMS section, SELECT accounts for 76.57%. Therefore, we should focus on SELECT statements for further analysis.

Through the "Slow 20 ms" line, we can see that slow logs have also appeared. I set the slow log threshold quite low, only 20 ms, so you can see that 1.5 slow logs are being generated per second. The reason I set the slow log threshold relatively low is mainly to record slower SQL queries. However, in your application, you should set it based on the actual situation. It should not be set too high or too low, otherwise it will cause problems.

### Targeted Analysis

Now let's look at the slow logs. Please remember, it is better to clear the logs before examining MySQL slow logs, so that the log only records slow SQL queries from the period of high load. Otherwise, there will be a lot of affected data.

```
[root@7dgroup1 gaolou]# pt-query-digest slow-query.log

# 7.2s user time, 70ms system time, 36.78M rss, 106.05M vsz
# Current date: Wed Dec 30 23:30:14 2020
# Hostname: 7dgroup1
# Files: slow-query.log
# Overall: 36.60k total, 7 unique, 89.06 QPS, 17.17x concurrency _________
# Time range: 2020-12-30T15:22:00 to 2020-12-30T15:28:51
# Attribute          total     min     max     avg     95%  stddev  median
# ============     ======= ======= ======= ======= ======= ======= =======
# Exec time          7055s    20ms      1s   193ms   501ms   160ms   128ms
# Lock time             7s       0    39ms   194us   247us   696us   125us
# Rows sent         35.45k       0       1    0.99    0.99    0.09    0.99
# Rows examine       2.33G       0 112.76k  66.71k 112.33k  46.50k 112.33k
# Query size        14.26M       6    1016  408.53  592.07  195.17  202.40

# Profile
# Rank Query ID                      Response time   Calls R/Call V/M   It
# ==== ============================= =============== ===== ====== ===== ==
#    1 0xB8BDB35AD896842FAC41202B... 5744.3322 81.4% 18420 0.3119  0.07 SELECT pms_sku_stock
#    2 0xC71984B4087F304BE41AC8F8... 1309.1841 18.6% 18138 0.0722  0.03 SELECT oms_cart_item
# MISC 0xMISC                           1.4979  0.0%    46 0.0326   0.0 <5 ITEMS>

# Query 1: 44.82 QPS, 13.98x concurrency, ID 0xB8BDB35AD896842FAC41202BB9C908E8 at byte 6504041
```

The above data shows the slow logs. When examining MySQL slow logs, it is best to clear the logs first so that only slow SQL queries during the high load period are recorded.
# This item is included in the report because it matches --limit.
# Scores: V/M = 0.07
# Time range: 2020-12-30T15:22:00 to 2020-12-30T15:28:51
# Attribute    pct   total     min     max     avg     95%  stddev  median
# ============ === ======= ======= ======= ======= ======= ======= =======
# Count         50   18420
# Exec time     81   5744s    76ms      1s   312ms   580ms   148ms   279ms
# Lock time     47      3s    70us    37ms   184us   224us   673us   119us
# Rows sent     50  17.99k       1       1       1       1       0       1
# Rows examine  85   1.98G 112.76k 112.76k 112.76k 112.76k       0 112.76k
# Query size    26   3.72M     212     212     212     212       0     212
# String:
# Hosts        10.100.5.54
# Users        reader
# Query_time distribution
#   1us
#  10us
# 100us
#   1ms
#  10ms  #
# 100ms  ################################################################
#    1s  #
#  10s+
# Tables
#    SHOW TABLE STATUS LIKE 'pms_sku_stock'\G
#    SHOW CREATE TABLE `pms_sku_stock`\G
# EXPLAIN /*!50100 PARTITIONS*/
select


    id, product_id, sku_code, price, stock, low_stock, pic, sale, promotion_price, lock_stock,
    sp_data

    from pms_sku_stock


     WHERE (  sku_code = '202008270027906' )\G


# Query 2: 44.13 QPS, 3.19x concurrency, ID 0xC71984B4087F304BE41AC8F82A88B245 at byte 20901845
# This item is included in the report because it matches --limit.
# Scores: V/M = 0.03
# Time range: 2020-12-30T15:22:00 to 2020-12-30T15:28:51
# Attribute    pct   total     min     max     avg     95%  stddev  median
# ============ === ======= ======= ======= ======= ======= ======= =======
# Count         49   18138
# Exec time     18   1309s    20ms   419ms    72ms   148ms    43ms    59ms
# Lock time     52      4s    76us    39ms   205us   260us   719us   138us
# Rows sent     49  17.45k       0       1    0.99    0.99    0.12    0.99
# Rows examine  14 356.31M  19.96k  20.22k  20.12k  19.40k       0  19.40k
# Query size    73  10.51M     604     610  607.81  592.07       0  592.07
# String:
# Hosts        10.100.5.54
# Users        reader
# Query_time distribution
#   1us
#  10us
# 100us
#   1ms
#  10ms  ################################################################
# 100ms  ##################
#    1s
#  10s+
# Tables
#    SHOW TABLE STATUS LIKE 'oms_cart_item'\G
#    SHOW CREATE TABLE `oms_cart_item`\G
# EXPLAIN /*!50100 PARTITIONS*/
select


    id, product_id, product_sku_id, member_id, quantity, price, product_pic, product_name,
    product_sub_title, product_sku_code, member_nickname, create_date, modify_date, delete_status,
    product_category_id, product_brand, product_sn, product_attr
    from oms_cart_item
     WHERE (  member_id = 381920
                       and product_id = 317
                  and delete_status = 0
                  and product_sku_id = 317 )\G

From the data above, our optimization direction is relatively clear: we need to optimize the two slowest SQL queries, one of which takes up 81.4% of the total time, and the other takes up 18.6%.

Let’s start with the slowest SQL query:

select
    id, product_id, sku_code, price, stock, low_stock, pic, sale, promotion_price, lock_stock,
    sp_data
    from pms_sku_stock
     WHERE (  sku_code = '202008270027906' )\G

To understand where a statement is slow, we need to examine the execution plan:

In the execution plan, the ’type’ column has a value of ‘ALL’, indicating that the SQL query did not use an index. Let’s check the index for the sku_code column.

Upon inspecting the index, we see that there is only one ‘ID’ column, which is a primary key index, and it does not include the ‘sku_code’ column specified in the WHERE condition. Therefore, let’s add an index for ‘sku_code’ to enable precise querying without scanning the entire table:

ALTER TABLE pms_sku_stock ADD INDEX sku_code_index (sku_code);

After modifying the index, let’s take a look at the execution plan:

Now, the ’type’ column has a value of ‘ref’, indicating that the WHERE condition is using an index. Let’s run the scenario again and see the effect:

From the results, we can see that the TPS has increased from 50 to above 150. The response time has also decreased from around 750ms to below 250ms. The improvement is significant.

After optimizing the first SQL query, let’s move on to the second SQL query. Similarly, let’s start with its execution plan:

The ’type’ column still has a value of ‘ALL’, indicating that the WHERE conditions are not using an index. However, the second query has multiple WHERE conditions, so let’s directly add a composite index to enable the WHERE conditions to utilize the index:

ALTER TABLE oms_cart_item ADD INDEX mix_index (member_id,product_id,product_sku_id);

After adding the composite index, the execution plan for this SQL query is as follows:

Once again, let’s run the scenario and see the effect of optimizing these two slowest SQL queries.

Optimization Effect #

The optimization effect is as follows:

The before and after comparison of the optimization is shown below:

I suggest that when writing the report, you use this type of comparison chart to clearly demonstrate the significant effect of the optimization.

Analysis Phase 2 #

Now we need to analyze the errors, we can’t just ignore them.

Pressure Data #

Below is the corresponding error graph, I cropped the image a bit so that you can see the trend:

As you can see, there are both correct and incorrect TPS values, and the error rate increases as the TPS increases. This is important, so please remember it.

Next, let’s break down the response time.

Breaking Down Response Time #

First, set the time range in SkyWalking:

Please note that when looking at the performance counters, the time windows on each tool must correspond.

  • User - Gateway:

  • Gateway - Cart:

-

  • Cart - Member:

  • Cart - MySQL:

  • Member - MySQL:

After listing all this information… we didn’t find anything.

You may be wondering why we didn’t find anything, especially considering that the response time on the Cart side is long. Here you need to note that the problem we are analyzing is errors, not response time. So if the response time is long, it’s just long. During analysis, you must always remember what problem you are investigating and not deviate from it, otherwise you will get confused.

Global Analysis #

Usually, our global analysis starts with resources, which means going through the performance analysis decision tree step by step. As we discussed in Lesson 4, you can check all the first-level counters.

However, in our analysis of this issue, we don’t need to go through all that trouble, because we already saw a large number of errors when looking at the pressure data. In order to analyze the errors, we need to know where they are occurring. Therefore, we can directly check the logs related to this. When looking at the logs, we found the following error messages:

2020-12-30 23:44:06.754 ERROR 1 --- [io-8086-exec-41] o.a.c.c.C.[.[.[/].[dispatcherServlet]    : Servlet.service() for servlet [dispatcherServlet] in context with path [] threw exception [Request processing failed; nested exception is org.springframework.dao.DeadlockLoserDataAccessException: 
### Error updating database.  Cause: com.mysql.cj.jdbc.exceptions.MySQLTransactionRollbackException: Deadlock found when trying to get lock; try restarting transaction
### The error may involve com.dunshan.mall.mapper.OmsCartItemMapper.updateByPrimaryKey-Inline
### The error occurred while setting parameters
### SQL: update oms_cart_item     set product_id = ?,       product_sku_id = ?,       member_id = ?,       quantity = ?,       price = ?,       product_pic = ?,       product_name = ?,       product_sub_title = ?,       product_sku_code = ?,       member_nickname = ?,       create_date = ?,       modify_date = ?,       delete_status = ?,       product_category_id = ?,       product_brand = ?,       product_sn = ?,       product_attr = ?     where id = ?
### Cause: com.mysql.cj.jdbc.exceptions.MySQLTransactionRollbackException: Deadlock found when trying to get lock; try restarting transaction
; Deadlock found when trying to get lock; try restarting transaction; nested exception is com.mysql.cj.jdbc.exceptions.MySQLTransactionRollbackException: Deadlock found when trying to get lock] with root cause
...................................

This error clearly indicates a deadlock. But why did the deadlock occur?

In performance analysis, you need to remember that deadlocks are actually relatively easy to analyze. There is a lock only when there is contention, and a deadlock means the locks are being fought over and both parties are stuck.

Next, let’s conduct a targeted analysis to figure out why the deadlock occurred.

Targeted Analysis #

First, let’s find the code corresponding to the “add to cart” business:

   /**
    * Add to cart
    * @param productSkuCode SKU code of the product
    * @param quantity Quantity of the product
    * @return
    */
   @Override
   public int addCart(String productSkuCode, Integer quantity) {
    .........................................
       OmsCartItem existCartItem = getCartItem(cartItem);
       if (existCartItem == null) {
           cartItem.setCreateDate(new Date());
           count = cartItemMapper.insert(cartItem);
       } else {
           cartItem.setModifyDate(new Date());
           existCartItem.setQuantity(existCartItem.getQuantity() + cartItem.getQuantity());
           count = cartItemMapper.updateByPrimaryKey(existCartItem);
       }

       return count;
   }

The transaction that references this code is as follows:

@Transactional
int addCart(String productSkuCode, Integer quantity);

Based on the relationship mentioned above, what can cause a deadlock for adding items to the shopping cart? Look, there is an update in the code, which corresponds to the update statement in the previous log. So, if a deadlock occurs, it must be due to ID conflict, and this ID corresponds to the member ID. In other words, multiple threads trying to update the shopping cart of the same member at the same time is the cause of the deadlock!

Since there is a member ID conflict, who provides the member information? Without a doubt, this member information must be passed from the script, so let’s check the script associated with it.

The corresponding script is as follows:

You see, there is a parameter productSkuCode which is used in 1000 rows of data.

The above image corresponds to the JMeter script as follows:

Let’s look at these three parameters in the JMeter script:

quotedData: false
recycle: true
stopThread: false

This means that all our threads are sharing these 1000 rows of data and cycling continuously. This will lead to duplicate data usage, meaning that if two or more threads use the same user data, they will update the same shopping cart, resulting in conflict errors.

Now let’s change the above three parameters:

quotedData: true
recycle: false
stopThread: true

This ensures that each thread can be assigned different data.

However, another problem arises: if we handle it this way, 1000 rows of data are not enough. What should we do? In that case, we have no choice but to increase the user data. After generating more tokens, we can execute the scenario.

After generating data overnight, the next day arrives.

Optimization Results #

As a result, we obtained the following result:

From the data, there are no more error messages, which is a reasonable result.

Summary #

Now, let’s summarize this lesson.

“Hey, hold on. Don’t summarize yet, we haven’t solved all the problems. Look, we still have the issue of TPS dropping, right?”

“Young man, don’t rush. Rome wasn’t built in a day. We need to solve the problems one by one. I will address this issue in later lessons.”

In this lesson, we started from the low TPS and analyzed it all the way to the specific SQL. It may seem like a simple matter that can be resolved with just two simple indexes, and the logic isn’t complicated either. However, the analysis approach is crucial.

As for the second problem, we tracked down the deadlock information in the logs from the erroneous data. Most people should be able to do that. However, only experienced individuals would immediately think of parameter conflicts.

Additionally, there is another key point here: parameterized data must align with real-world scenarios! Teacher Gao has repeatedly emphasized this many times, hoping that you can remember it.

Homework #

Lastly, I have two questions for you to ponder:

  1. Besides using the methods mentioned in this lesson, do you have any other ways to quickly identify slow SQL queries?
  2. Can you draw the logic for the analysis in the second phase?

Remember to discuss and exchange your ideas with me in the comments section. Every thought process will help you move forward.

If you have gained something from reading this article, feel free to share it with your friends and learn and progress together. See you in the next lesson!