19 Generating Order Information First Using Jdbc Pool Optimization and Memory Overflow Analysis

19 Generating order information first using JDBC pool optimization and memory overflow analysis #

Hello, I’m Gao Lou (高楼).

In this lesson, let’s take a look at the benchmarking results for the generation of order interface.

You will see some recurring problems, such as SQL problem positioning. Although the specific problems may vary, our analysis logic remains the same, and I will briefly go through it. At the same time, you will also encounter some new issues, such as JVM memory consumption due to the increase in JDBC connection pool when dealing with large amounts of data, and problems with locks caused by the coexistence of batch business and real-time business. In this lesson, we will focus on further optimizing such problems.

Without further ado, let’s get started!

Scenario Execution Data #

For the order generation interface, the results of our first performance scenario execution are as follows:

From the results of the scenario execution, we can see that with 40 pressure threads, only a little over 50 TPS were achieved, and the response time increased to nearly 700ms. Clearly, this is a very slow interface.

From such an interface, it seems that we made the right choice with this project because performance issues are everywhere.

Next, let’s analyze step by step.

Architecture Diagram #

We have described multiple times before that drawing an architecture diagram is to understand the analysis path. Therefore, as usual, we still list the architecture diagram here.

Due to the complexity of this interface, the architecture diagram looks a bit messy. So, I simplified it again:

The Order service is the core of this interface. Therefore, you can see that I have filtered out all the services related to Order, so we can clearly know what it is connected to.

Next, let’s break down the response time.

Analyzing Response Time #

Since we noticed that the response time was relatively long during the scenario run, we decided to use an APM tool to analyze it further:

Gateway:

From the above graph, we can see that the time spent on the Gateway is around 700ms, which aligns with the previous scenario data.

I would like to clarify that the sampling interval for this graph is one minute, so you may notice some discrepancies between this curve and the TPS curve given by the load testing tool in some details. However, this is not a concern, as we should focus more on the overall trend.

Order:

As mentioned earlier, Order is the key service for producing order information, and its business logic is also quite complex. Therefore, we need to pay more attention to this service.

Based on the data, the time consumed by Order is around 350 milliseconds, accounting for half of the total response time. This is the aspect we need to analyze in-depth. The forwarding capability of the Gateway is also a point of concern, although it lacks logic and only handles forwarding. If the slow Gateway forwarding is caused by a large amount of data, once we solve the Order’s problem, the Gateway issue will also be resolved. Therefore, let’s analyze the problem with Order first.

Now, let’s proceed with the analysis.

Phase 1 #

Global Monitoring Analysis #

Let’s start with the global monitoring:

At first glance, there’s nothing significant. There’s no obvious CPU resource consumption, network resource usage, or IO resource bottleneck.

In this situation, we need to pay attention to the limited points in the entire chain. What are the limited points? For example, various pools (connection pool, etc.), locks in the stack, database connections, and database locks. In summary, it can be described with one keyword: blocking.

Once we identify the blocking points, we can widen the chain and utilize the resources.

Of course, it’s also possible that after analyzing the entire chain, we find no blocking points, but still can’t utilize the resources. In this case, there is only one possibility: the analysis is not detailed enough. It’s because there might be too many potential blocking points, and we can only break them down step by step.

Targeted Monitoring Analysis #

As the saying goes, “Keep a decision tree in mind, so as not to get lost in analysis.” When it comes to targeted monitoring analysis, following the performance analysis decision tree emphasized in Lesson 4, let’s start by analyzing the Order service:

While analyzing the thread stack information of the Order service, I found a lot of this content in the stack:

"http-nio-8086-exec-421" Id=138560 WAITING on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@a268a48
    at sun.misc.Unsafe.park(Native Method)
    -  waiting on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@a268a48
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
    at com.alibaba.druid.pool.DruidDataSource.takeLast(DruidDataSource.java:1899)
    at com.alibaba.druid.pool.DruidDataSource.getConnectionInternal(DruidDataSource.java:1460)
    at com.alibaba.druid.pool.DruidDataSource.getConnectionDirect(DruidDataSource.java:1255)
    at com.alibaba.druid.filter.FilterChainImpl.dataSource_connect(FilterChainImpl.java:5007)
    at com.alibaba.druid.filter.stat.StatFilter.dataSource_getConnection(StatFilter.java:680)
    at com.alibaba.druid.filter.FilterChainImpl.dataSource_connect(FilterChainImpl.java:5003)
    at com.alibaba.druid.pool.DruidDataSource.getConnection(DruidDataSource.java:1233)
    at com.alibaba.druid.pool.DruidDataSource.getConnection(DruidDataSource.java:1225)
    at com.alibaba.druid.pool.DruidDataSource.getConnection(DruidDataSource.java:90)
    ..........................
    at com.dunshan.mall.order.service.impl.PortalOrderServiceImpl$$EnhancerBySpringCGLIB$$f64f6aa2.generateOrder(<generated>)
    at com.dunshan.mall.order.controller.PortalOrderController.generateOrder$original$hak2sOst(PortalOrderController.java:48)
    at com.dunshan.mall.order.controller.PortalOrderController.generateOrder$original$hak2sOst$accessor$NTnIbuo7(PortalOrderController.java)
    at com.dunshan.mall.order.controller.PortalOrderController$auxiliary$MTWkGopH.call(Unknown Source)
    ..........................
    at com.dunshan.mall.order.controller.PortalOrderController.generateOrder(PortalOrderController.java)
    ..........................

See, there are many getConnection in the stack information, indicating that the Order service is waiting for database connections from the connection pool. So, what we need to do is to increase the size of the JDBC pool:

Original configuration:
      initial-size: 5 #initial size of the connection pool
      min-idle: 10 #minimum number of idle connections
      max-active: 20 #maximum number of connections
      
Modified configuration:
      initial-size: 20 #initial size of the connection pool
      min-idle: 10 #minimum number of idle connections
      max-active: 40 #maximum number of connections

As you can see, I didn’t increase the size of the JDBC pool all at once, mainly because I don’t want too much CPU consumption to maintain the connection pool. I also suggest that when you increase the resource pool, do it gradually and check if there is any effect. If there is, then continue increasing.

After modifying the JDBC pool, let’s take a look at the execution data of the stress scenario:

From the data, we can see that the TPS has an upward trend and reached over 150 at one point. However, shortly after, the TPS dropped. At this time, the response time did not increase significantly. Also, you can see that the TPS not only dropped but also became intermittent, extremely unstable.

In addition, we can also find that in the subsequent stress scenarios, not only did errors occur, but the response time also increased. At the same time, I checked the resources in the global monitoring and didn’t find any significant resource consumption. Since errors occurred, without saying anything more, we have to deal with them!

Phase 2 #

Global Monitoring Analysis #

Since we made modifications to the JDBC pool of the Order service earlier, let’s first check the health status of the Order service when encountering new issues. When viewing the top of the Order service, the following information is displayed:

top - 01:28:17 up 19 days, 11:54,  3 users,  load average: 1.14, 1.73, 2.27
Tasks: 316 total,   1 running, 315 sleeping,   0 stopped,   0 zombie
%Cpu0  :100.0 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  :  3.0 us,  2.7 sy,  0.0 ni, 93.6 id,  0.0 wa,  0.0 hi,  0.3 si,  0.3 st
%Cpu2  :  3.4 us,  3.4 sy,  0.0 ni, 93.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu3  :  3.7 us,  2.8 sy,  0.0 ni, 93.5 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu4  :  3.6 us,  2.1 sy,  0.0 ni, 93.6 id,  0.0 wa,  0.0 hi,  0.3 si,  0.3 st
%Cpu5  :  2.8 us,  1.8 sy,  0.0 ni, 95.4 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 16265992 total,  2229060 free,  9794944 used,  4241988 buff/cache
KiB Swap:        0 total,        0 free,        0 used.  6052732 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                  
29349 root      20   0 8836040   4.3g  16828 S  99.7 27.4  20:51.90 java                                                     
 1089 root      20   0 2574864  98144  23788 S   6.6  0.6   2066:38 kubelet

Unfortunately, there is some unfortunate data here. As you can see, one of the us CPUs has reached 100%! What’s the situation here?

After entering the container, I used the top -Hp and jstack -l 1 commands to view the process. I found that the VM Thread thread was consuming CPU, and this thread is used for garbage collection (GC). In that case, let’s check the memory reclaim status by viewing jstat:

[root@svc-mall-order-7fbdd7b85f-ks828 /]# jstat -gcutil 1 1s
  S0     S1     E      O      M     CCS    YGC     YGCT    FGC    FGCT     GCT   
  0.00 100.00 100.00 100.00  94.86  93.15    168   28.822    93  652.664  681.486
  0.00 100.00 100.00 100.00  94.86  93.15    168   28.822    93  652.664  681.486
  0.00 100.00 100.00 100.00  94.86  93.15    168   28.822    93  652.664  681.486
  0.00 100.00 100.00 100.00  94.86  93.15    168   28.822    93  652.664  681.486
  0.00 100.00 100.00 100.00  94.86  93.15    168   28.822    93  652.664  681.486
  0.00 100.00 100.00 100.00  94.86  93.15    168   28.822    93  652.664  681.486
  0.00 100.00 100.00 100.00  94.86  93.15    168   28.822    93  652.664  681.486
  0.00 100.00 100.00 100.00  94.86  93.15    168   28.822    94  659.863  688.685
  0.00 100.00 100.00 100.00  94.86  93.15    168   28.822    94  659.863  688.685
  0.00 100.00 100.00 100.00  94.86  93.15    168   28.822    94  659.863  688.685
  0.00 100.00 100.00 100.00  94.86  93.15    168   28.822    94  659.863  688.685
  0.00 100.00 100.00 100.00  94.86  93.15    168   28.822    94  659.863  688.685
  0.00 100.00 100.00 100.00  94.86  93.15    168   28.822    94  659.863  688.685
  0.00 100.00 100.00 100.00  94.86  93.15    168   28.822    94  659.863  688.685

0.00 100.00 100.00 100.00 94.86 93.15 168 28.822 95 667.472 696.294 0.00 100.00 100.00 100.00 94.86 93.15 168 28.822 95 667.472 696.294 0.00 100.00 100.00 100.00 94.86 93.15 168 28.822 95 667.472 696.294 0.00 100.00 100.00 100.00 94.86 93.15 168 28.822 95 667.472 696.294 0.00 100.00 100.00 100.00 94.86 93.15 168 28.822 95 667.472 696.294 0.00 100.00 100.00 100.00 94.86 93.15 168 28.822 95 667.472 696.294 0.00 100.00 100.00 100.00 94.86 93.15 168 28.822 95 667.472 696.294 0.00 100.00 100.00 100.00 94.86 93.15 168 28.822 95 667.472 696.294 0.00 100.00 100.00 100.00 94.85 93.14 168 28.822 96 674.816 703.638 0.00 100.00 100.00 100.00 94.85 93.14 168 28.822 96 674.816 703.638 0.00 100.00 100.00 100.00 94.85 93.14 168 28.822 96 674.816 703.638 0.00 100.00 100.00 100.00 94.85 93.14 168 28.822 96 674.816 703.638 0.00 100.00 100.00 100.00 94.85 93.14 168 28.822 96 674.816 703.638 0.00 100.00 100.00 100.00 94.85 93.14 168 28.822 96 674.816 703.638 0.00 100.00 100.00 100.00 94.85 93.14 168 28.822 96 674.816 703.638

From the above data, FullGCs are occurring repeatedly, but they are unable to reclaim memory, which is a serious problem.

Please note that for such situations, our normal judgment logic should be: even if a real-time business system has FullGCs, they should still be able to reclaim memory to a normal state. If the HEAP memory is insufficient, we can increase it. But if the HEAP keeps decreasing until FullGC cannot reclaim memory, then there is a problem.

Therefore, for such issues, we need to analyze them from two perspectives:

The memory is being used and therefore FullGC cannot reclaim it.
There is memory leakage, and it has already leaked to a point where FullGC cannot reclaim it.

For targeted monitoring analysis, we need to consider these two aspects.

Targeted Monitoring Analysis #

Since the memory is already full, let’s execute jmap -histo:live 1 | head -n 50 to see which memory categories occupy a relatively large proportion:

[root@svc-mall-order-7fbdd7b85f-ks828 /]# jmap -histo:live 1|head -n 50

 num     #instances         #bytes  class name
----------------------------------------------
  1:      74925020     2066475600  [B
  2:       2675397      513676056  [[B
  3:       2675385       85612320  com.mysql.cj.protocol.a.result.ByteArrayRow
  4:       2675386       42806176  com.mysql.cj.protocol.a.MysqlTextValueDecoder
  5:        246997       27488016  [C
  6:         80322       16243408  [Ljava.lang.Object;

Memory Analysis #

To analyze the memory, we can filter out Java’s own objects and only focus on business-related objects. From the first two lines, we can see that they are related to com.mysql.cj.protocol and SQL. So, we will check the innodb_trx table to see if there are any long-running SQL queries.

During the investigation, we found the following SQL query:

select id, member_id, coupon_id, order_sn, create_time, member_username, total_amount pay_amount, freight_amount, promotion_amount, integration_amount, coupon_amount discount_amount, pay_type, source_type, status, order_type, delivery_company, delivery_sn auto_confirm_day, integration, growth, promotion_info, bill_type, bill_header, bill_content bill_receiver_phone, bill_receiver_email, receiver_name, receiver_phone, receiver_post_code receiver_province, receiver_city, receiver_region, receiver_detail_address, note, confirm_status delete_status, use_integration, payment_time, delivery_time, receive_time, comment_time modify_time
from oms_order
WHERE (id = 0 and status = 0 and delete_status = 0)

Furthermore, I queried this statement and found that it involves 4,358,761 records. It is evident that there is an issue in the code. So, let’s check where this SQL query is called in the code.

By reviewing the code, I found the following logic:

example.createCriteria().andIdEqualTo(orderId).andStatusEqualTo(0).andDeleteStatusEqualTo(0);
List<OmsOrder> cancelOrderList = orderMapper.selectByExample(example);

This code corresponds to the following select statement:

<select id="selectByExample" parameterType="com.dunshan.mall.model.OmsOrderExample" resultMap="BaseResultMap">
    select
    <if test="distinct">
        distinct
    </if>
    <include refid="Base_Column_List" />
    from oms_order
    <if test="_parameter != null">
        <include refid="Example_Where_Clause" />
    </if>
    <if test="orderByClause != null">
        order by ${orderByClause}
    </if>
</select>

This is a typical issue where the statement is not filtered. Such development projects are usually meant for demos at most. If it were a real production project, it could have caused considerable damage.

Here, we can directly modify the code to add a limit, so that it does not fetch all the data at once.

Let’s see the optimization effect:

TPS improvement

As you can see, there are no TPS interruptions anymore, and there is an optimization effect, indicating that the SQL query will not fetch too much data and exhaust the memory.

However, TPS value has not increased significantly, so we need to perform the third phase of analysis.

Phase 3 #

This time, we won’t look at the overall monitoring data. Based on our previous experience, we will directly conduct targeted monitoring analysis.

Targeted Monitoring Analysis #

Because we modified the SQL statement earlier, after executing the SQL, we need to check the innodb_trx table to see if there are any slow SQL statements. As a result, we found the following SQL statement:

Let’s take a look at its execution plan:

This is again a typical full table scan and it is used in an update statement. Seeing this, do you have the urge to pull the developers out and make them face the consequences?

Since generating order information is a complex process, we won’t rush to optimize this SQL statement. Instead, let’s analyze the entire slow log first.

Please note that sometimes when a project has many scenarios, the impact between the data can be significant, which can lead to the inaccuracy of our analysis. So, it’s best to clean up the slow log. Personally, I always do this because I don’t want to see messy data.

After cleaning up the slow SQL statements and re-executing the scenarios, I extracted the slow log again and analyzed it with pt-digest-query (we discussed this in Lesson 16, if you don’t remember, I suggest you review it), and obtained the following data:

# Profile
# Rank Query ID                     Response time   Calls R/Call   V/M   I
# ==== ============================ =============== ===== ======== ===== =
#    1 0x2D9130DB1449730048AA1B5... 1233.4054 70.5%     3 411.1351  2.73 UPDATE oms_order
#    2 0x68BC6C5F4E7FFFC7D17693A...  166.3178  9.5%  2677   0.0621  0.60 INSERT oms_order
#    3 0xB86E9CC7B0BA539BD447915...   91.3860  5.2%  1579   0.0579  0.01 SELECT ums_member
#    4 0x3135E50F729D62260977E0D...   61.9424  3.5%     4  15.4856  0.30 SELECT oms_order
#    5 0xAE72367CD45AD907195B3A2...   59.6041  3.4%     3  19.8680  0.13 SELECT oms_order
#    6 0x695C8FFDF15096AAE9DBFE2...   49.1613  2.8%  1237   0.0397  0.01 SELECT ums_member_receive_address
#    7 0xD732B16862C1BC710680BB9...   25.5382  1.5%   471   0.0542  0.01 SELECT oms_cart_item
# MISC 0xMISC                         63.2937  3.6%  1795   0.0353   0.0 <9 ITEMS>

From the above Profile information, we can see that the first statement consumed 70.5% of the total time, and the second statement consumed 9.5%. When solving performance issues, we are actually targeting these time-consuming SQL statements. We can temporarily ignore the SQL statements with short execution time.

Usually, in this case, you can focus on solving the first statement, and then conduct regression testing to evaluate the effect before deciding whether to address the second problem. Let me list these two complete SQL statements here:

1. UPDATE oms_order SET member_id = 260869, order_sn = '202102030100205526', create_time = '2021-02-03 01:05:56.0', member_username = '7dcmppdtest15176472465', total_amount = 0.00, pay_amount = 0.00, freight_amount = 0.00, promotion_amount = 0.00, integration_amount = 0.00, coupon_amount = 0.00, discount_amount = 0.00, pay_type = 0, source_type = 1, STATUS = 4, order_type = 0, auto_confirm_day = 15, integration = 0, growth = 0, promotion_info = '', receiver_name = '6mtf3', receiver_phone = '18551479920', receiver_post_code = '66343', receiver_province = 'Beijing', receiver_city = '7dGruop性能实战', receiver_region = '7dGruop性能实战区', receiver_detail_address = '3d16z吉地12号', confirm_status = 0, delete_status = 0 WHERE id = 0;


2. INSERT INTO oms_order (member_id, coupon_id, order_sn, create_time, member_username, total_amount, pay_amount, freight_amount, promotion_amount, integration_amount, coupon_amount, discount_amount, pay_type, source_type, status, order_type, delivery_company, delivery_sn, auto_confirm_day, integration, growth, promotion_info, bill_type, bill_header, bill_content, bill_receiver_phone, bill_receiver_email, receiver_name, receiver_phone, receiver_post_code, receiver_province, receiver_city, receiver_region, receiver_detail_address, note, confirm_status, delete_status, use_integration, payment_time, delivery_time, receive_time, comment_time, modify_time) VALUES (391265, null, '202102030100181755', '2021-02-03 01:01:03.741', '7dcmpdtest17793405657', 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, null, null, 15, 0, 0, '', null, null, null, null, null, 'belod', '15618648303', '93253', '北京', '7dGruop性能实战', '7dGruop性能实战区', 'hc9r1吉地12号', null, 0, 0, null, null, null, null, null);

Let’s start with the first statement. Although this update statement is not called frequently, it is particularly slow. Obviously, this shouldn’t be a real-time API call, so let’s find out which business is calling this statement. Look at this statement, it updates the data with ID 0 in the where condition, which seems to be a batch business.

Now let’s move on to the second statement. The second insert statement is called multiple times and should be a real-time transaction SQL. Usually, we optimize insert using bulk insert to improve efficiency, so we need to adjust the bulk_insert_buffer_size parameter (default is 8M) to achieve this. Because bulk_insert_buffer_size is used to improve efficiency when bulk inserting data. I have checked this parameter and it has not been optimized, still using the default value.

Here, you need to pay attention to one point. In the production environment, since indexes need to be added to the Order table, it is recommended to separate the master and slave to avoid mutual interference between update, insert, and select statements.

After analyzing these two SQL statements, let’s first find the source of the first SQL. By searching the code, we can see that this statement is called here:

orderMapper.updateByPrimaryKeySelective(cancelOrder);

However, please note that this updateByPrimaryKeySelective method is in the batch tasks, which should be separated from real-time transactions. If you are a member of the performance team, providing optimization advice to the architecture or development team, you can make the following recommendations:

Read-write separation;
Separate batch business and real-time business.

Here, I will separate this batch business and not call it. But in real business logic, you cannot do this. I did this in order to observe the effects and direction of performance optimization in the following steps.

After making the above changes, the TPS is as follows:

From the perspective of performance, the TPS has reached around 300 and the response time seems to be stable. We can finally move on to the normal performance analysis stage, hahaha.

However, our work is not yet finished here. From the above image, TPS around 300 is still relatively low considering our overall system resources. Therefore, there is still room for optimization in this interface. So, in the next lesson, we will continue to discuss optimizations.

Summary #

In this class, we conducted analysis and optimization in three stages.

In the first stage, we modified the JDBC pool. Although TPS showed an upward trend, new issues also emerged: TPS became very unstable, with intermittent interruptions.

In the second stage, we analyzed and located the problem of memory overflow, and optimized the memory issue. Although we clearly saw the effect of optimization on the TPS curve, it still did not reach the desired level.

In the third stage, we analyzed and located the problem with SQL, which is very logical. Since we made modifications to the SQL in the second stage, in the third stage we directly located the problem in the database. From the results, the effect is good, and TPS is showing an obvious normal trend. However, you need to pay attention that when batch tasks and real-time tasks are performed in the same database and operate on the same table, you need to consider whether the architectural design is reasonable.

In conclusion, in this class, you can see that when the data queried by SQL reaches the application memory, it leads to an increase in memory. The increase in application memory also increases the number of GC operations, thereby consuming more CPU resources.

Homework #

Finally, please think about two questions:

Why would you check the memory consumption when you see high CPU usage of the VM Thread in the JDK?
Why do you need to query the innodb_trx table first when analyzing SQL issues in MySQL?

Remember to discuss and exchange your thoughts with me in the comments section. Every thought you have will help you make further progress.

If you have gained something from reading this article, feel free to share it with your friends and learn and progress together. See you in the next lesson!