16 Goods added to shopping cart SQL optimization and parameter analysis in stress tools #
Hello, I am Gao Lou.
In today’s lesson, I will talk to you about SQL optimization and parameter analysis in load testing tools using the example of the interface for adding products to the shopping cart.
When it comes to SQL optimization, many people tend to suspect that there is a problem with the SQL when they see high database resource utilization. Although this direction seems correct, it is often difficult to determine which exact SQL query is causing the problem and what kind of problem it is. Therefore, in this lesson, I will teach you how to quickly identify problematic SQL queries based on high resource utilization and make appropriate adjustments. In addition, you will learn how to handle database locks caused by improper use of parameters in load testing tools.
Now, let’s begin the analysis for this lesson.
Stress Data #
For the interface of adding products to the shopping cart, the performance results of our first run are as follows:
Looking at it, I feel like crying. Do you feel the same way? From this chart, there is more than one problem. With my limited professional experience, I opened my spiritual eye and felt that there are two problems here:
- Even at its peak, the TPS is not high, only around 50;
- During peak times, there are a lot of errors being generated.
So which problem is more important? Some people may say that obviously we should handle the errors, isn’t it dizzying to see so many errors? If you have OCD, then there is no way around it, you can handle the errors first.
However, in my opinion, it is also possible to address the issue of low TPS first. Because even though there are errors, not all of them are wrong, only 5% of them, so what’s the rush.
Nevertheless, no matter what, we still need to follow the thinking process of performance analysis decision tree.
Look at the architecture diagram #
The logic of this interface is clear: Pressure tool - Gateway - Cart - Member.
I plan to analyze the problem of low TPS and increased response time, which can be observed in the first half of the stress curve. Therefore, our analysis will start with breaking down the response time.
If you want to handle errors first in such a scenario, start by checking the logs. In fact, these errors are easy to deal with because they provide very clear directions.
Analysis Phase 1 #
Breakdown of Response Time #
This time let’s take a look at the screenshots.
- User - Gateway:
- Gateway - Cart:
- Cart - Member:
- Cart - MySQL:
- Member - MySQL:
Based on the response time, we need to address MySQL first, specifically the SQL related to the Cart service, because the response time between Cart and MySQL is a bit long.
Global Analysis #
As usual, let’s take a look at the overall monitoring.
Since the CPU usage on worker-1 is high, let’s see what services are running on worker-1.
You may ask why the network download bandwidth is also highlighted in red and has exceeded 100Mb. This has to do with how to interpret the counters. Although the network is highlighted in red here, it is only slightly over 100Mb, and it’s highlighted in red because of the threshold setting in Grafana Dashboard. If you don’t want it to be highlighted in red, you can set a higher threshold. Besides, for the network, slightly over 100Mb is not considered high.
Let’s see what’s on worker-1.
[root@k8s-master-2 ~]# kubectl get pods -o wide|grep k8s-worker-1
elasticsearch-data-1 1/1 Running 1 11d 10.100.230.57 k8s-worker-1 <none> <none>
elasticsearch-master-0 1/1 Running 0 3d11h 10.100.230.60 k8s-worker-1 <none> <none>
mysql-min-d564fc4df-vs7d6 1/1 Running 0 22h 10.100.230.1 k8s-worker-1 <none> <none>
[root@k8s-master-2 ~]#
As you can see, there is not only MySQL but also ES data running on worker-1, which consumes a lot of network resources. However, the issue is not currently pointing to it.
Since we saw earlier that the response time of MySQL is long, let’s continue to look at the data in the global monitoring on worker-1.
[root@k8s-worker-1 ~]# top
top - 23:08:21 up 3 days, 11:30, 5 users, load average: 29.90, 28.54, 23.00
Tasks: 309 total, 1 running, 307 sleeping, 0 stopped, 1 zombie
%Cpu0 : 94.1 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 2.9 si, 2.9 st
%Cpu1 : 94.1 us, 2.9 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 2.9 si, 0.0 st
%Cpu2 : 90.9 us, 3.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 6.1 st
%Cpu3 : 89.7 us, 3.4 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 3.4 si, 3.4 st
%Cpu4 : 87.9 us, 6.1 sy, 0.0 ni, 3.0 id, 0.0 wa, 0.0 hi, 0.0 si, 3.0 st
%Cpu5 : 87.9 us, 9.1 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 3.0 st
```
KiB Mem : 16265992 total, 1176564 free, 8436112 used, 6653316 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 7422832 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
21344 27 20 0 8222204 628452 12892 S 331.4 3.9 141:36.72 /opt/rh/rh-mysql57/root/usr/libexec/mysqld --defaults-file=/etc/my.cnf
5128 techstar 20 0 5917564 1.4g 21576 S 114.3 8.8 233:09.48 /usr/share/elasticsearch/jdk/bin/java -Xshare:auto -Des.networkaddress.cache+
5127 techstar 20 0 14.1g 3.5g 25756 S 40.0 22.8 1647:28 /usr/share/elasticsearch/jdk/bin/java -Xshare:auto -Des.networkaddress.cache+
1091 root 20 0 1145528 108228 29420 S 25.7 0.7 263:51.49 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock
1078 root 20 0 2504364 106288 38808 S 14.3 0.7 429:13.57 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.co+
17108 root 20 0 164472 2656 1712 R 14.3 0.0 0:00.66 top
```
From the data above, we can see that the MySQL process consumes a lot of CPU, which indicates that our evidence chain is correct. Since we have reached the database, what should we look at? Of course, we should focus on MySQL's global monitoring. So, I printed the MySQL Report and filtered out some unnecessary data to obtain the following results (otherwise the content would be too long):
```
__ Questions ___________________________________________________________
Total 637.05k 8.0/s
DMS 293.57k 3.7/s %Total: 46.08
Com_ 235.02k 2.9/s 36.89
.............................
Slow 20 ms 119.50k 1.5/s 18.76 %DMS: 40.70 Log:
DMS 293.57k 3.7/s 46.08
SELECT 224.80k 2.8/s 35.29 76.57
UPDATE 51.86k 0.6/s 8.14 17.66
INSERT 16.92k 0.2/s 2.66 5.76
REPLACE 0 0/s 0.00 0.00
DELETE 0 0/s 0.00 0.00
.............................
__ SELECT and Sort _____________________________________________________
Scan 137.84k 1.7/s %SELECT: 61.32
.............................
```
From the data above, we can see that in the "Total" section, DMS (Data Manipulation Statements) accounts for 46.08%. In the DMS section, SELECT accounts for 76.57%. Therefore, we should focus on SELECT statements for further analysis.
Through the "Slow 20 ms" line, we can see that slow logs have also appeared. I set the slow log threshold quite low, only 20 ms, so you can see that 1.5 slow logs are being generated per second. The reason I set the slow log threshold relatively low is mainly to record slower SQL queries. However, in your application, you should set it based on the actual situation. It should not be set too high or too low, otherwise it will cause problems.
### Targeted Analysis
Now let's look at the slow logs. Please remember, it is better to clear the logs before examining MySQL slow logs, so that the log only records slow SQL queries from the period of high load. Otherwise, there will be a lot of affected data.
```
[root@7dgroup1 gaolou]# pt-query-digest slow-query.log
# 7.2s user time, 70ms system time, 36.78M rss, 106.05M vsz
# Current date: Wed Dec 30 23:30:14 2020
# Hostname: 7dgroup1
# Files: slow-query.log
# Overall: 36.60k total, 7 unique, 89.06 QPS, 17.17x concurrency _________
# Time range: 2020-12-30T15:22:00 to 2020-12-30T15:28:51
# Attribute total min max avg 95% stddev median
# ============ ======= ======= ======= ======= ======= ======= =======
# Exec time 7055s 20ms 1s 193ms 501ms 160ms 128ms
# Lock time 7s 0 39ms 194us 247us 696us 125us
# Rows sent 35.45k 0 1 0.99 0.99 0.09 0.99
# Rows examine 2.33G 0 112.76k 66.71k 112.33k 46.50k 112.33k
# Query size 14.26M 6 1016 408.53 592.07 195.17 202.40
# Profile
# Rank Query ID Response time Calls R/Call V/M It
# ==== ============================= =============== ===== ====== ===== ==
# 1 0xB8BDB35AD896842FAC41202B... 5744.3322 81.4% 18420 0.3119 0.07 SELECT pms_sku_stock
# 2 0xC71984B4087F304BE41AC8F8... 1309.1841 18.6% 18138 0.0722 0.03 SELECT oms_cart_item
# MISC 0xMISC 1.4979 0.0% 46 0.0326 0.0 <5 ITEMS>
# Query 1: 44.82 QPS, 13.98x concurrency, ID 0xB8BDB35AD896842FAC41202BB9C908E8 at byte 6504041
```
The above data shows the slow logs. When examining MySQL slow logs, it is best to clear the logs first so that only slow SQL queries during the high load period are recorded.
# This item is included in the report because it matches --limit.
# Scores: V/M = 0.07
# Time range: 2020-12-30T15:22:00 to 2020-12-30T15:28:51
# Attribute pct total min max avg 95% stddev median
# ============ === ======= ======= ======= ======= ======= ======= =======
# Count 50 18420
# Exec time 81 5744s 76ms 1s 312ms 580ms 148ms 279ms
# Lock time 47 3s 70us 37ms 184us 224us 673us 119us
# Rows sent 50 17.99k 1 1 1 1 0 1
# Rows examine 85 1.98G 112.76k 112.76k 112.76k 112.76k 0 112.76k
# Query size 26 3.72M 212 212 212 212 0 212
# String:
# Hosts 10.100.5.54
# Users reader
# Query_time distribution
# 1us
# 10us
# 100us
# 1ms
# 10ms #
# 100ms ################################################################
# 1s #
# 10s+
# Tables
# SHOW TABLE STATUS LIKE 'pms_sku_stock'\G
# SHOW CREATE TABLE `pms_sku_stock`\G
# EXPLAIN /*!50100 PARTITIONS*/
select
id, product_id, sku_code, price, stock, low_stock, pic, sale, promotion_price, lock_stock,
sp_data
from pms_sku_stock
WHERE ( sku_code = '202008270027906' )\G
# Query 2: 44.13 QPS, 3.19x concurrency, ID 0xC71984B4087F304BE41AC8F82A88B245 at byte 20901845
# This item is included in the report because it matches --limit.
# Scores: V/M = 0.03
# Time range: 2020-12-30T15:22:00 to 2020-12-30T15:28:51
# Attribute pct total min max avg 95% stddev median
# ============ === ======= ======= ======= ======= ======= ======= =======
# Count 49 18138
# Exec time 18 1309s 20ms 419ms 72ms 148ms 43ms 59ms
# Lock time 52 4s 76us 39ms 205us 260us 719us 138us
# Rows sent 49 17.45k 0 1 0.99 0.99 0.12 0.99
# Rows examine 14 356.31M 19.96k 20.22k 20.12k 19.40k 0 19.40k
# Query size 73 10.51M 604 610 607.81 592.07 0 592.07
# String:
# Hosts 10.100.5.54
# Users reader
# Query_time distribution
# 1us
# 10us
# 100us
# 1ms
# 10ms ################################################################
# 100ms ##################
# 1s
# 10s+
# Tables
# SHOW TABLE STATUS LIKE 'oms_cart_item'\G
# SHOW CREATE TABLE `oms_cart_item`\G
# EXPLAIN /*!50100 PARTITIONS*/
select
id, product_id, product_sku_id, member_id, quantity, price, product_pic, product_name,
product_sub_title, product_sku_code, member_nickname, create_date, modify_date, delete_status,
product_category_id, product_brand, product_sn, product_attr
from oms_cart_item
WHERE ( member_id = 381920
and product_id = 317
and delete_status = 0
and product_sku_id = 317 )\G
From the data above, our optimization direction is relatively clear: we need to optimize the two slowest SQL queries, one of which takes up 81.4% of the total time, and the other takes up 18.6%.
Let’s start with the slowest SQL query:
select
id, product_id, sku_code, price, stock, low_stock, pic, sale, promotion_price, lock_stock,
sp_data
from pms_sku_stock
WHERE ( sku_code = '202008270027906' )\G
To understand where a statement is slow, we need to examine the execution plan:
In the execution plan, the ’type’ column has a value of ‘ALL’, indicating that the SQL query did not use an index. Let’s check the index for the sku_code
column.
Upon inspecting the index, we see that there is only one ‘ID’ column, which is a primary key index, and it does not include the ‘sku_code’ column specified in the WHERE condition. Therefore, let’s add an index for ‘sku_code’ to enable precise querying without scanning the entire table:
ALTER TABLE pms_sku_stock ADD INDEX sku_code_index (sku_code);
After modifying the index, let’s take a look at the execution plan:
Now, the ’type’ column has a value of ‘ref’, indicating that the WHERE condition is using an index. Let’s run the scenario again and see the effect:
From the results, we can see that the TPS has increased from 50 to above 150. The response time has also decreased from around 750ms to below 250ms. The improvement is significant.
After optimizing the first SQL query, let’s move on to the second SQL query. Similarly, let’s start with its execution plan:
The ’type’ column still has a value of ‘ALL’, indicating that the WHERE conditions are not using an index. However, the second query has multiple WHERE conditions, so let’s directly add a composite index to enable the WHERE conditions to utilize the index:
ALTER TABLE oms_cart_item ADD INDEX mix_index (member_id,product_id,product_sku_id);
After adding the composite index, the execution plan for this SQL query is as follows:
Once again, let’s run the scenario and see the effect of optimizing these two slowest SQL queries.
Optimization Effect #
The optimization effect is as follows:
The before and after comparison of the optimization is shown below:
I suggest that when writing the report, you use this type of comparison chart to clearly demonstrate the significant effect of the optimization.
Analysis Phase 2 #
Now we need to analyze the errors, we can’t just ignore them.
Pressure Data #
Below is the corresponding error graph, I cropped the image a bit so that you can see the trend:
As you can see, there are both correct and incorrect TPS values, and the error rate increases as the TPS increases. This is important, so please remember it.
Next, let’s break down the response time.
Breaking Down Response Time #
First, set the time range in SkyWalking:
Please note that when looking at the performance counters, the time windows on each tool must correspond.
- User - Gateway:
- Gateway - Cart:
-
- Cart - Member:
- Cart - MySQL:
- Member - MySQL:
After listing all this information… we didn’t find anything.
You may be wondering why we didn’t find anything, especially considering that the response time on the Cart side is long. Here you need to note that the problem we are analyzing is errors, not response time. So if the response time is long, it’s just long. During analysis, you must always remember what problem you are investigating and not deviate from it, otherwise you will get confused.
Global Analysis #
Usually, our global analysis starts with resources, which means going through the performance analysis decision tree step by step. As we discussed in Lesson 4, you can check all the first-level counters.
However, in our analysis of this issue, we don’t need to go through all that trouble, because we already saw a large number of errors when looking at the pressure data. In order to analyze the errors, we need to know where they are occurring. Therefore, we can directly check the logs related to this. When looking at the logs, we found the following error messages:
2020-12-30 23:44:06.754 ERROR 1 --- [io-8086-exec-41] o.a.c.c.C.[.[.[/].[dispatcherServlet] : Servlet.service() for servlet [dispatcherServlet] in context with path [] threw exception [Request processing failed; nested exception is org.springframework.dao.DeadlockLoserDataAccessException:
### Error updating database. Cause: com.mysql.cj.jdbc.exceptions.MySQLTransactionRollbackException: Deadlock found when trying to get lock; try restarting transaction
### The error may involve com.dunshan.mall.mapper.OmsCartItemMapper.updateByPrimaryKey-Inline
### The error occurred while setting parameters
### SQL: update oms_cart_item set product_id = ?, product_sku_id = ?, member_id = ?, quantity = ?, price = ?, product_pic = ?, product_name = ?, product_sub_title = ?, product_sku_code = ?, member_nickname = ?, create_date = ?, modify_date = ?, delete_status = ?, product_category_id = ?, product_brand = ?, product_sn = ?, product_attr = ? where id = ?
### Cause: com.mysql.cj.jdbc.exceptions.MySQLTransactionRollbackException: Deadlock found when trying to get lock; try restarting transaction
; Deadlock found when trying to get lock; try restarting transaction; nested exception is com.mysql.cj.jdbc.exceptions.MySQLTransactionRollbackException: Deadlock found when trying to get lock] with root cause
...................................
This error clearly indicates a deadlock. But why did the deadlock occur?
In performance analysis, you need to remember that deadlocks are actually relatively easy to analyze. There is a lock only when there is contention, and a deadlock means the locks are being fought over and both parties are stuck.
Next, let’s conduct a targeted analysis to figure out why the deadlock occurred.
Targeted Analysis #
First, let’s find the code corresponding to the “add to cart” business:
/**
* Add to cart
* @param productSkuCode SKU code of the product
* @param quantity Quantity of the product
* @return
*/
@Override
public int addCart(String productSkuCode, Integer quantity) {
.........................................
OmsCartItem existCartItem = getCartItem(cartItem);
if (existCartItem == null) {
cartItem.setCreateDate(new Date());
count = cartItemMapper.insert(cartItem);
} else {
cartItem.setModifyDate(new Date());
existCartItem.setQuantity(existCartItem.getQuantity() + cartItem.getQuantity());
count = cartItemMapper.updateByPrimaryKey(existCartItem);
}
return count;
}
The transaction that references this code is as follows:
@Transactional
int addCart(String productSkuCode, Integer quantity);
Based on the relationship mentioned above, what can cause a deadlock for adding items to the shopping cart? Look, there is an update
in the code, which corresponds to the update
statement in the previous log. So, if a deadlock occurs, it must be due to ID conflict, and this ID corresponds to the member ID. In other words, multiple threads trying to update the shopping cart of the same member at the same time is the cause of the deadlock!
Since there is a member ID conflict, who provides the member information? Without a doubt, this member information must be passed from the script, so let’s check the script associated with it.
The corresponding script is as follows:
You see, there is a parameter productSkuCode
which is used in 1000 rows of data.
The above image corresponds to the JMeter script as follows:
Let’s look at these three parameters in the JMeter script:
quotedData: false
recycle: true
stopThread: false
This means that all our threads are sharing these 1000 rows of data and cycling continuously. This will lead to duplicate data usage, meaning that if two or more threads use the same user data, they will update the same shopping cart, resulting in conflict errors.
Now let’s change the above three parameters:
quotedData: true
recycle: false
stopThread: true
This ensures that each thread can be assigned different data.
However, another problem arises: if we handle it this way, 1000 rows of data are not enough. What should we do? In that case, we have no choice but to increase the user data. After generating more tokens, we can execute the scenario.
After generating data overnight, the next day arrives.
Optimization Results #
As a result, we obtained the following result:
From the data, there are no more error messages, which is a reasonable result.
Summary #
Now, let’s summarize this lesson.
“Hey, hold on. Don’t summarize yet, we haven’t solved all the problems. Look, we still have the issue of TPS dropping, right?”
“Young man, don’t rush. Rome wasn’t built in a day. We need to solve the problems one by one. I will address this issue in later lessons.”
In this lesson, we started from the low TPS and analyzed it all the way to the specific SQL. It may seem like a simple matter that can be resolved with just two simple indexes, and the logic isn’t complicated either. However, the analysis approach is crucial.
As for the second problem, we tracked down the deadlock information in the logs from the erroneous data. Most people should be able to do that. However, only experienced individuals would immediately think of parameter conflicts.
Additionally, there is another key point here: parameterized data must align with real-world scenarios! Teacher Gao has repeatedly emphasized this many times, hoping that you can remember it.
Homework #
Lastly, I have two questions for you to ponder:
- Besides using the methods mentioned in this lesson, do you have any other ways to quickly identify slow SQL queries?
- Can you draw the logic for the analysis in the second phase?
Remember to discuss and exchange your ideas with me in the comments section. Every thought process will help you move forward.
If you have gained something from reading this article, feel free to share it with your friends and learn and progress together. See you in the next lesson!