21 How to Analyze and Optimize a Fixed Technological Component Before Payment Query Order List

21 How to analyze and optimize a fixed technological component before payment query order list #

Today, we will analyze the interface for querying the order list before payment.

In this lesson, I will show you the approach to analyzing and optimizing a fixed technical component. In other words, the component is not developed by us, but we still need to analyze and optimize it. What can we do?

Additionally, we will encounter a problem where the CPU usage of the database is not fully utilized, but only a few cores are being used. In such cases, how should we proceed? To address this, we will use the method of examining the thread stack of the database itself, which is different from directly looking at the trx table.

Now, let’s delve into today’s content together.

Scene Execution Data #

Let’s start by looking at the performance results for the first run of the interface for querying the order list before payment:

From the data obtained during the execution of the scene, we can see that the TPS (transactions per second) for this interface was initially quite high, reaching over 800. However, the response time also increased, indicating that a bottleneck has occurred. By identifying the bottleneck, we can determine if there is room for optimization in this interface.

Before proceeding with the formal analysis, let’s take a look at the architecture diagram according to the analysis logic provided by Professor Gao.

Architecture Diagram #

This architecture diagram is very clear. We can see that the current interface logic is: Gateway - Order - Member. MySQL and Redis are also used in this architecture.

Next, let’s take a look at where the response time is being consumed.

Splitting Response Time #

Gateway: -
Order: -
Member: -

From the distribution of response time, the time consumed on the Gateway is longer. Therefore, we need to start from the Gateway and analyze where the time is being consumed.

Phase 1 #

Global Monitoring and Analysis #

Following the logic of " first look at global monitoring, then look at targeted monitoring ," let’s first examine the global monitoring of this interface:

After reviewing the global monitoring view, we need to determine which worker the Gateway is on due to its long response time:

[root@k8s-master-2 ~]# kubectl get pods -o wide | grep gateway
gateway-mall-gateway-757659dbc9-tdwnm       1/1     Running     0          3d16h   10.100.79.96     k8s-worker-4   <none>           <none>
[root@k8s-master-2 ~]#

We can see that the Gateway service is on worker-4. Additionally, on the global monitoring graph, although the Gateway only consumes 70% of the CPU, it still has the longest response time. Therefore, we need to observe the thread status of the Gateway and see what it is processing.

Targeted Monitoring and Analysis #

When performing targeted monitoring, let’s first look at the CPU consumption of the threads:

From the above image, we can see that there are two important types of working threads in the Gateway: reactor-http-epoll and boundedElastic.

In the official documentation, it is mentioned that the reactor-http-epoll thread should ideally be set to the same number as the number of CPUs. Currently, we have 4 reactor-http-epoll threads, while this worker has 6 cores, so we can increase the number of threads by 2, although adding more doesn’t have much meaning. As for boundedElastic, it is a bounded elastic thread pool, by default set to CPU cores x 10, and there is no optimization space for it.

Let’s continue to observe the proportion of time consumed by the threads in the Gateway service and see if there are any exceptions in the method-level time consumption, such as exceptionally high proportions, as shown in the following graph:

You can see that the current execution methods are all normal.

Now let’s increase the number of threads to 6 and see if we can utilize more CPU. If even after using more CPU, the Gateway still has the longest response time, then we will have to continue increasing the CPU.

Please note that in a performance project, do not easily suggest adding CPU. Only provide such suggestions after analyzing the logic and confirming that there is no other optimization space.

Optimization Effect #

Let’s look at the optimization effect:

Through regression testing, we can see that the TPS has slightly increased, but it is not very significant in the latter half of the graph (since the Gateway was restarted during the testing process, we can consider the TPS before that as a warm-up phase), with an increase of about 50 TPS. However, it can still be considered an improvement.

Our optimization does not end here because during the process of viewing each worker, I also noticed a strange phenomenon - two CPUs in the database have very high usage. Let’s investigate further.

Phase 2 #

Global Monitoring and Analysis #

Since the optimization in the previous phase did not yield significant results, we need to start the analysis again. Let’s start with global monitoring:

It doesn’t seem to be an issue, and there doesn’t appear to be any major resource consumption on the worker-1 where the database is located.

Please note that I often use this interface to view the data from global monitoring in my articles. However, this does not mean that I only rely on this interface. When I do not see any obvious problem points in this interface, I also check some commands, such as top/vmstat, which is related to the complete counters of global monitoring that I have been mentioning. Therefore, you need to have a view of the global monitoring counters in your mind in order to truly observe the first level of counters.

Let’s take a look at the top data on the worker where the database is located, and we found the following phenomenon:

bash-4.2$ top      
top - 09:57:43 up 3 days, 17:54,  0 users,  load average: 4.40, 3.57, 3.11
Tasks:  11 total,   1 running,   9 sleeping,   1 stopped,   0 zombie
%Cpu0  :  8.0 us,  4.7 sy,  0.0 ni, 84.3 id,  0.0 wa,  0.0 hi,  2.2 si,  0.7 st
%Cpu1  :100.0 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu2  :  6.5 us,  4.4 sy,  0.0 ni, 85.5 id,  0.0 wa,  0.0 hi,  2.2 si,  1.5 st
%Cpu3  :  7.8 us,  5.7 sy,  0.0 ni, 83.7 id,  0.0 wa,  0.0 hi,  2.1 si,  0.7 st
%Cpu4  : 96.0 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  4.0 si,  0.0 st
%Cpu5  :  7.0 us,  4.0 sy,  0.0 ni, 84.9 id,  0.0 wa,  0.0 hi,  2.6 si,  1.5 st
KiB Mem : 16265992 total,  1203032 free,  6695156 used,  8367804 buff/cache
KiB Swap:        0 total,        0 free,        0 used.  9050344 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                      
    1 mysql     20   0 8272536   4.7g  13196 S 248.8 30.5   6184:36 mysqld

It is very obvious that the usage of two CPUs is high. Let’s narrow down the analysis to the database.

Before that, let’s silently recite the phrase “As long as our thinking is clear, any problem is manageable” ten times, because keeping a clear mindset is very important.

Targeted Monitoring and Analysis #

We want to focus on analyzing the database, but not all CPUs on the database are showing high usage. Therefore, let’s take a look at what actions the database threads are taking. With the process information we have above, let’s dive deeper into the thread level:

bash-4.2$ top -Hp 1
top - 09:56:40 up 3 days, 17:53,  0 users,  load average: 3.05, 3.30, 3.01
Threads:  92 total,   2 running,  90 sleeping,   0 stopped,   0 zombie
%Cpu0  :  5.4 us,  2.9 sy,  0.0 ni, 89.2 id,  0.0 wa,  0.0 hi,  2.2 si,  0.4 st
%Cpu1  : 99.7 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.3 st

%Cpu2  :  5.4 us,  3.2 sy,  0.0 ni, 88.2 id,  0.0 wa,  0.0 hi,  2.5 si,  0.7 st
%Cpu3  :  6.3 us,  4.2 sy,  0.0 ni, 87.0 id,  0.0 wa,  0.0 hi,  2.1 si,  0.4 st
%Cpu4  : 96.3 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  3.7 si,  0.0 st
%Cpu5  :  4.0 us,  2.5 sy,  0.0 ni, 91.0 id,  0.0 wa,  0.0 hi,  1.8 si,  0.7 st
KiB Mem : 16265992 total,  1205356 free,  6692736 used,  8367900 buff/cache
KiB Swap:        0 total,        0 free,        0 used.  9052664 avail Mem 


  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND                                                                       
  311 mysql     20   0 8272536   4.7g  13196 R 99.9 30.5  18:20.34 mysqld                                                                        
  241 mysql     20   0 8272536   4.7g  13196 R 99.7 30.5   1906:40 mysqld                                                                        
  291 mysql     20   0 8272536   4.7g  13196 S  3.3 30.5  15:49.21 mysqld                                                                        
  319 mysql     20   0 8272536   4.7g  13196 S  3.0 30.5  11:50.34 mysqld                                                                        
  355 mysql     20   0 8272536   4.7g  13196 S  3.0 30.5  13:01.53 mysqld                                                                        
  265 mysql     20   0 8272536   4.7g  13196 S  2.7 30.5  18:17.48 mysqld                                                                        
  307 mysql     20   0 8272536   4.7g  13196 S  2.7 30.5  16:47.77 mysqld                                                                        
  328 mysql     20   0 8272536   4.7g  13196 S  2.7 30.5  15:34.92 mysqld                                                                        
  335 mysql     20   0 8272536   4.7g  13196 S  2.7 30.5   8:55.38 mysqld                                                                        
  316 mysql     20   0 8272536   4.7g  13196 S  2.3 30.5  14:38.68 mysqld                                                                        
  350 mysql     20   0 8272536   4.7g  13196 S  2.3 30.5  10:37.94 mysqld                                                                        
  233 mysql     20   0 8272536   4.7g  13196 S  2.0 30.5  14:19.32 mysqld                                                                        
  279 mysql     20   0 8272536   4.7g  13196 S  2.0 30.5  19:51.80 mysqld                                                                        
  318 mysql     20   0 8272536   4.7g  13196 S  2.0 30.5  11:34.62 mysqld                                                                        
  331 mysql     20   0 8272536   4.7g  13196 S  2.0 30.5  11:46.94 mysqld                                                                        
  375 mysql     20   0 8272536   4.7g  13196 S  2.0 30.5   1:29.22 mysqld                                                                        
  300 mysql     20   0 8272536   4.7g  13196 S  1.7 30.5  17:45.26 mysqld                                                                        
  380 mysql     20   0 8272536   4.7g  13196 S  1.7 30.5   1:24.32 mysqld

You see, only two MySQL threads are using the CPU. At this point, you might think about checking the SQL! Although it could be a problem with the SQL, I still recommend finding the corresponding evidence.

Since MySQL is written in C, let’s use gstack (a command that comes with GDB) to print the stack of these two MySQL threads and see the specific functions. After extracting the stack of the two PIDs (311, 241), we see the following information:

Thread 59 (Thread 0x7f1d60174700 (LWP 241)): #0 0x000055a431fefea9 in JOIN_CACHE::read_record_field(st_cache_field*, bool) () #1 0x000055a431ff01ca in JOIN_CACHE::read_some_record_fields() () #2 0x000055a431ff070f in JOIN_CACHE::get_record() () #3 0x000055a431ff2a92 in JOIN_CACHE_BNL::join_matching_records(bool) () #4 0x000055a431ff18f0 in JOIN_CACHE::join_records(bool) () #5 0x000055a431e397c0 in evaluate_join_record(JOIN*, QEP_TAB*) () #6 0x000055a431e3f1a5 in sub_select(JOIN*, QEP_TAB*, bool) () #7 0x000055a431e37a90 in JOIN::exec() () #8 0x000055a431eaa0ba in handle_query(THD*, LEX*, Query_result*, unsigned long long, unsigned long long) () #9 0x000055a43194760d in execute_sqlcom_select(THD*, TABLE_LIST*) () #10 0x000055a431e6accf in mysql_execute_command(THD*, bool) () #11 0x000055a431e6d455 in mysql_parse(THD*, Parser_state*) () #12 0x000055a431e6e3b6 in dispatch_command(THD*, COM_DATA const*, enum_server_command) () #13 0x000055a431e6fc00 in do_command(THD*) () #14 0x000055a431f33938 in handle_connection () #15 0x000055a4320e66d4 in pfs_spawn_thread () #16 0x00007f1e8f1fcdd5 in start_thread () from /lib64/libpthread.so.0 #17 0x00007f1e8d3cc02d in clone () from /lib64/libc.so.6 Thread 41 (Thread 0x7f1d585e0700 (LWP 311)): #0 0x000055a4319dbe44 in Item_field::val_int() () #1 0x000055a4319fb839 in Arg_comparator::compare_int_signed() () #2 0x000055a4319fbd9b in Item_func_eq::val_int() () #3 0x000055a431ff24ab in JOIN_CACHE::check_match(unsigned char*) () #4 0x000055a431ff26ec in JOIN_CACHE::generate_full_extensions(unsigned char*) () #5 0x000055a431ff2ab4 in JOIN_CACHE_BNL::join_matching_records(bool) () #6 0x000055a431ff18f0 in JOIN_CACHE::join_records(bool) () #7 0x000055a431e397c0 in evaluate_join_record(JOIN*, QEP_TAB*) () #8 0x000055a431e3f1a5 in sub_select(JOIN*, QEP_TAB*, bool) () #9 0x000055a431e37a90 in JOIN::exec() () #10 0x000055a431eaa0ba in handle_query(THD*, LEX*, Query_result*, unsigned long long, unsigned long long) () #11 0x000055a43194760d in execute_sqlcom_select(THD*, TABLE_LIST*) () #12 0x000055a431e6accf in mysql_execute_command(THD*, bool) () #13 0x000055a431e6d455 in mysql_parse(THD*, Parser_state*) () #14 0x000055a431e6e3b6 in dispatch_command(THD*, COM_DATA const*, enum_server_command) () #15 0x000055a431e6fc00 in do_command(THD*) () #16 0x000055a431f33938 in handle_connection () #17 0x000055a4320e66d4 in pfs_spawn_thread () #18 0x00007f1e8f1fcdd5 in start_thread () from /lib64/libpthread.so.0 #19 0x00007f1e8d3cc02d in clone () from /lib64/libc.so.6

Thread 41 (Thread 0x7f1d585e0700 (LWP 311)):
#0  0x000055a4319dbe44 in Item_field::val_int() ()
#1  0x000055a4319fb839 in Arg_comparator::compare_int_signed() ()
#2  0x000055a4319fbd9b in Item_func_eq::val_int() ()
#3  0x000055a431ff24ab in JOIN_CACHE::check_match(unsigned char*) ()
#4  0x000055a431ff26ec in JOIN_CACHE::generate_full_extensions(unsigned char*) ()
#5  0x000055a431ff2ab4 in JOIN_CACHE_BNL::join_matching_records(bool) ()
#6  0x000055a431ff18f0 in JOIN_CACHE::join_records(bool) ()
#7  0x000055a431e397c0 in evaluate_join_record(JOIN*, QEP_TAB*) ()
#8  0x000055a431e3f1a5 in sub_select(JOIN*, QEP_TAB*, bool) ()
#9  0x000055a431e37a90 in JOIN::exec() ()
#10 0x000055a431eaa0ba in handle_query(THD*, LEX*, Query_result*, unsigned long long, unsigned long long) ()
#11 0x000055a43194760d in execute_sqlcom_select(THD*, TABLE_LIST*) ()
#12 0x000055a431e6accf in mysql_execute_command(THD*, bool) ()
#13 0x000055a431e6d455 in mysql_parse(THD*, Parser_state*) ()
#14 0x000055a431e6e3b6 in dispatch_command(THD*, COM_DATA const*, enum_server_command) ()
#15 0x000055a431e6fc00 in do_command(THD*) ()
#16 0x000055a431f33938 in handle_connection ()
#17 0x000055a4320e66d4 in pfs_spawn_thread ()
#18 0x00007f1e8f1fcdd5 in start_thread () from /lib64/libpthread.so.0
#19 0x00007f1e8d3cc02d in clone () from /lib64/libc.so.6

It is obvious that there are two execute_sqlcom_select functions, which are two select statements. Looking further up the stack, we can also see the JOIN function. Since it is a JOIN in the select statement, we can directly look for the SQL statement.

Therefore, we will directly query the innodb_trx table to see if there are any long-running SQL statements. You may execute commands like show processlist, but to see the complete SQL, I still recommend querying the trx table. Since we are using the default one-thread-per-connection thread_handling, the operating system threads are one-to-one correspondence with the threads in MySQL. So, directly querying the trx table here will not result in any mistaken judgments.

By querying the innodb_trx table, we can see that these two SQL statements have a long runtime and are listed below:

-- sql1
SELECT
    count(*)
FROM
    oms_order o
LEFT JOIN oms_order_item ot ON o.id = ot.order_id
WHERE
    o.STATUS = 0
AND o.create_time < date_add(NOW(), INTERVAL - 120 MINUTE)
LIMIT 0, 1000


-- sql2：
SELECT
    o.id,
    o.order_sn,
    o.coupon_id,
    o.integration,
    o.member_id,
    o.use_integration,
    ot.id ot_id,
    ot.product_name ot_product_name,
    ot.product_sku_id ot_product_sku_id,
    ot.product_sku_code ot_product_sku_code,
    ot.product_quantity ot_product_quantity
FROM
    oms_order o
LEFT JOIN oms_order_item ot ON o.id = ot.order_id
WHERE
    o.STATUS = 0
AND o.create_time < date_add(NOW(), INTERVAL - 120 MINUTE)

As we have mentioned before, to see the slow SQL, we need to look at the execution plan of the SQL (in MySQL, if the execution plan is not clear, you can also look at the Profile information). The execution plans for these two SQL statements are as follows:

They are both common full table scans. At this point, you may feel a bit bored. However, we still need to know why these two statements are being generated.

In fact, the interface for querying the order list before payment did not use these two SQL statements. So, I looked at the code and found the following code:

@Scheduled(cron = "0 0/20 * ? * ?")
private void cancelTimeOutOrder(){
    Integer count = portalOrderService.cancelTimeOutOrder();
    LOGGER.info("Cancel order and release locked inventory: {}",count);
}

Clearly, this is a scheduled task that runs every 20 minutes. At this point, the problem becomes clear. It turns out that the scheduled task called these two batch query statements, which caused two CPU usages to reach 100% and continue for a period of time.

For such scheduled tasks, we need to pay special attention and separate them from real-time business processing to reduce resource contention between batch processing and real-time processing. If they are processed together, we need to control the amount of data to be queried in the SQL statement so that the query is reasonable.

Since there are plenty of available CPUs in the database, this scheduled task did not have a significant impact on our TPS. There is no need to take any action here, but we should be mindful of separating them in the future.

Summary #

In this lesson, although our optimization did not significantly increase the TPS (transactions per second), I have thoroughly documented the entire analysis process because the technical details of the analysis are different.

In the first stage of the analysis, we used the same analytical approach as before. The difference is that in order to optimize a well-established component, we need to understand its architecture and identify its relevant performance parameters. In actual performance projects, when facing such components, we often don’t have time to focus on the internal implementation and need to make judgments quickly. If time allows, you can slowly tinker with it.

Understanding the principles of a technical component is not as unattainable and unfathomable as imagined. As long as you patiently go through it, you will always grow.

In the second stage of the analysis, we identified specific SQL issues based on certain CPU high phenomena. Although this process is simple, we can see from this problem that there is still a lot of optimization potential in the system, such as master-slave separation, splitting scheduled tasks into separate services, and so on. However, the focus of our performance analysis is still the analytical approach that I have been instilling in you. I hope you keep it in mind.

Homework #

Lastly, I have two questions for you to think about:

Why is it important to look at all the global monitoring counters?
When there is high CPU usage, how can you pinpoint the specific problem area? What are your thoughts?

Remember to discuss and exchange your ideas with me in the comments section. Every thought will help you progress further.

If you have gained something from reading this article, feel free to share it with your friends and learn and progress together. See you in the next lesson!