17 Why the Checkout Parameters in the Shopping Cart Must Match the Real Business Characteristics

17 Why the checkout parameters in the shopping cart must match the real business characteristics #

Hello, I am Gao Lou.

Today, let’s take a look at the query shopping cart interface.

So far, this is the sixth interface we have analyzed. However, I want you to understand that we analyze each interface not to understand the logic of the interface itself, but to analyze different performance issues through benchmark testing of different interfaces and bring you more analysis cases.

Many people still ask questions like “What should we do when the underlying data does not match the production environment?” during the execution of performance scenarios. Actually, the answer is quite simple, which is that we cannot simulate the problems in the production environment.

Therefore, in this lesson, you will see what specific impact it will have on TPS when the underlying data is unreasonable. From this, you will further understand why I have been emphasizing that the underlying data must conform to the logic of the production environment.

In addition, we will also analyze another issue that may make you feel frustrated. You will find that we have analyzed it for a long time, and the logic seems very reasonable, but the results are not satisfactory. Faced with this situation, how should we deal with it? I will leave this as a suspense and we will directly start today’s analysis.

Pressure Data #

For the query shopping cart interface, it’s the same as before. Let’s first look at the performance scenario results from the first run. This is performance data that is heart-wrenching from the beginning:

As you can see, as the number of threads increases, the TPS only reaches 40, while the response time keeps increasing from the beginning.

What can be done about this? According to our RESAR performance analysis logic, the first step is still to look at the architecture diagram, followed by splitting the response time. Because the response time keeps increasing, it is very easy for us to split the response time.

Architecture Diagram #

Before we dive into breaking down the response time, let’s take a look at the architecture diagram. At this stage, you only need to get a rough idea of the diagram, as we will come back to it multiple times later.

Stage 1 Analysis #

Breaking Down Response Time #

We have been repeatedly emphasizing that when performing performance analysis, the first step is to break down the response time.

When others ask me questions, they often describe it like this: TPS is low, response time is long, where is the bottleneck? When I see such a question, I usually ask in return: Where is the long response time? Then, the classic dialogue ending appears—“I don’t know.” I really want to help the other person solve the problem, but with such a description, I don’t have a clue where to start.

As someone doing performance analysis, how can you just describe a long response time? You should at least tell others where it is slow. That’s why I have always emphasized the importance of drawing architectural diagrams. Because with a diagram, you have the means to break down the time, and then we won’t be blind, unless you have nothing at all.

When breaking down the response time, you also need to pay attention to finding the right time period. Based on my experience, it is usually based on the trend of the response time. If it is consistently long, it’s simple, you can look at any time period. But if it is long sometimes and short sometimes, then you need to be careful when breaking down the response time; you need to select the time period in the monitoring tool carefully.

Here, we chose the SkyWalking time period: 2021-01-02 13:53:00 - 2021-01-02 13:54:00. The specific breakdown of time is as follows:

  • User - Gateway:

  • Gateway:

  • Gateway - Cart:

  • Cart:

  • Cart - MySQL:

From the captured data above, you can clearly see that the slow response time is in the Cart service.

We need to be aware that some data capture tools may have significant data deviation due to issues with the tools themselves. For example, for the SkyWalking time period mentioned above, we see that the average response time between Gateway and Cart is 829.25. However, in the Cart service, it is 984.50. Even with the same time period, there are some deviations here.

Every monitoring tool more or less has performance data deviations. Just take docker stats, I really don’t want to look at it. Therefore, sometimes we need to compare data from multiple tools.

Targeted Monitoring Analysis #

After breaking down the response time, instead of starting with a global analysis, we directly jump into targeted monitoring. Because we already know that the Cart service is slow for the queryCartItems interface, we can directly dive in and see where the slow methods are.

The calling method for this interface is as follows:

/**
 * Query cart items by member id
 *
 * @param memberId The member id
 * @return
 */
@Override
public List<OmsCartItem> list(Long memberId) {
    if (memberId == null) {
        return null;
    }
    OmsCartItemExample example = new OmsCartItemExample();
    example.createCriteria().andDeleteStatusEqualTo(0).andMemberIdEqualTo(memberId);
    return cartItemMapper.selectByExample(example);
}

From the above code, we know the method name, so we can directly use Arthas to trace this interface. The command is as follows:

trace com.dunshan.mall.cart.service.imp.CartItemServiceImpl list -v -n 5 --skipJDKMethod false '1==1'

As a result, we get the following information: [arthas@1]$ trace com.dunshan.mall.cart.service.imp.CartItemServiceImpl list -v -n 5 –skipJDKMethod false ‘1==1’ Condition expression: 1==1 , result: true ---ts=2021-01-02 14:59:53;thread_name=http-nio-8086-exec-556;id=10808;is_daemon=true;priority=5;TCCL=org.springframework.boot.web.embedded.tomcat.TomcatEmbeddedWebappClassLoader@18c26588 —[999.018045ms] com.dunshan.mall.cart.service.imp.CartItemServiceImpl$$EnhancerBySpringCGLIB$$e110d1ef:list() `—[998.970849ms] org.springframework.cglib.proxy.MethodInterceptor:intercept() #57

Condition expression: 1==1 , result: true
`---ts=2021-01-02 14:59:54;thread_name=http-nio-8086-exec-513;id=107d3;is_daemon=true;priority=5;TCCL=org.springframework.boot.web.embedded.tomcat.TomcatEmbeddedWebappClassLoader@18c26588
    `---[1095.593933ms] com.dunshan.mall.cart.service.imp.CartItemServiceImpl$$EnhancerBySpringCGLIB$$e110d1ef:list()
        `---[1095.502983ms] org.springframework.cglib.proxy.MethodInterceptor:intercept() #57


Condition expression: 1==1 , result: true
`---ts=2021-01-02 14:59:53;thread_name=http-nio-8086-exec-505;id=1078b;is_daemon=true;priority=5;TCCL=org.springframework.boot.web.embedded.tomcat.TomcatEmbeddedWebappClassLoader@18c26588
    `---[2059.097767ms] com.dunshan.mall.cart.service.imp.CartItemServiceImpl$$EnhancerBySpringCGLIB$$e110d1ef:list()
        `---[2059.013275ms] org.springframework.cglib.proxy.MethodInterceptor:intercept() #57


Condition expression: 1==1 , result: true
`---ts=2021-01-02 14:59:54;thread_name=http-nio-8086-exec-541;id=107f6;is_daemon=true;priority=5;TCCL=org.springframework.boot.web.embedded.tomcat.TomcatEmbeddedWebappClassLoader@18c26588
    `---[1499.559298ms] com.dunshan.mall.cart.service.imp.CartItemServiceImpl$$EnhancerBySpringCGLIB$$e110d1ef:list()
        `---[1499.498896ms] org.springframework.cglib.proxy.MethodInterceptor:intercept() #

From the above data, we can see that the response time of the `list()` method is indeed long, but this interface is not complex, it is just a select statement. The corresponding select statement in the Mapper is as follows:

```xml
<select id="selectByExample" parameterType="com.dunshan.mall.model.OmsCartItemExample" resultMap="BaseResultMap">
    select
    <if test="distinct">
      distinct
    </if>
    <include refid="Base_Column_List" />
    from oms_cart_item
    <if test="_parameter != null">
      <include refid="Example_Where_Clause" />
    </if>
    <if test="orderByClause != null">
      order by ${orderByClause}
    </if>
  </select>
```

This Mapper corresponds to the following SQL statement in the database:

```sql
SELECT id, product_id, product_sku_id, member_id, quantity, price, product_pic, product_name, product_sub_title, product_sku_code, member_nickname, create_date, modify_date, delete_status, product_category_id, product_brand, product_sn, product_attr FROM oms_cart_item WHERE (  delete_status = 0  AND member_id = 597427  )
```

Since the select statement has a long execution time, let's go to the database, and based on the corresponding SQL, look at the data histogram of the corresponding table. The command is as follows:

```sql
select member_id,count(*) from oms_cart_item_202101021530 GROUP BY 1 ORDER BY 2 DESC;
```

The result is as follows, we have taken a part of the histogram data:

![](../images/7990806d6ff746ab8fadd59b72fbb520.jpg)

From the above data in the database, we can see that there are already many data added under a member ID. Although the select statement is queried based on the member ID, no pagination is done. This is the simplest and most direct SQL problem, and the analysis process is also very simple. When we see that the SQL execution time is long, we need to check the execution plan:

![](../images/693b7079766647cfa5cfd51842213b5e.jpg)

Since the type value above is ALL, it means that a full table scan is performed. Therefore, based on the where condition in the SQL, we need to determine what index needs to be created. If the query result in the where condition is multiple rows and there is a lot of data, pagination needs to be done. Based on the analysis, we can easily think of two solutions:

1. Create an index: Creating an index is to allow the query to be performed accurately.
2. Implement pagination: This is to avoid returning too much data to the frontend.

### Optimization Effect

Although we are talking about "optimization effect," it is more accurate to say "validation effect". Because both of the above actions are to improve the query performance of the SQL, specifically, to reduce the amount of data returned. Now let's directly reduce the data and see if our judgment is correct.

To verify our analysis process, I will first truncate the table and see if the response time improves. If it does, then the problem lies here.

But what if it doesn't? Then I can only go back to the corner and silently shed tears. If I can't find such a simple problem, I am not a qualified performance analyst.

Anyway, let's see the result:

![](../images/c235abc3383d434fa38db295311aafca.jpg)

We can see that the TPS has increased a lot, which is very pleasing in a scenario with uninterrupted traffic. It seems that I can continue in this line of work.

However, our analysis does not end here. Just when we thought everything was going well, another problem occurred during the stress test, which forced us to enter the second phase of analysis.

Phase 2 Analysis #

What exactly is the problem this time? Here are the specific details:

What? Is that the TPS curve? Has it dropped? And by that much? It’s still a continuous scene. Does this mean my career is over?

This problem is a bit complicated. But from the response time curve, it’s obvious that the response time has increased and the TPS has dropped. Since that’s the case, let’s continue with our approach of splitting the response time, no need to go into more details here.

By splitting the time, we know that the problem with long response time lies in the Gateway. Now, let’s start the analysis based on the RESAR performance analysis logic.

Global Monitoring Analysis #

From a system-level perspective, we can clearly see that all worker nodes are not under pressure.

Let’s take a look from the perspective of Pods:

You see, some Pods are using up to 100% of CPU. Let’s sort all Pods, and here are the results:

Although we see that resources like node_exporter and ES are not used at their lowest, the CPU usage of these high-resource-consuming Pods is also restricted. At the same time, you should note that the Pods using high CPU resources do not include our application nodes, which means that the CPU resources of our application nodes have not been fully utilized.

I wanted to see the specific memory consumption on the worker where the application is running during this time period. However, there is no data available for this period:

You see, the data in the middle is missing, node_exporter is no longer transmitting data. There is no way, we can only give up on checking the memory consumption on the worker.

In that case, let’s first check which worker the Gateway is on and also see how many Pods are on this worker. We do this because in the entire Kubernetes cluster, all namespaces use the resources of the worker hosts. So, from a resource usage perspective, we need to consider Pods in all namespaces.

All Pods in all namespaces on the worker where the application is located are as follows:

  • First, query the name of the worker node where the gateway is located:

    [root@k8s-master-2 ~]# kubectl get pods –all-namespaces -o wide | grep gateway default gateway-mall-gateway-6567c8b49c-pc7rf 1/1 Running 0 15h 10.100.140.2 k8s-worker-2

  • Then, query all Pods corresponding to this worker:

    [root@k8s-master-2 ~]# kubectl get pods –all-namespaces -o wide | grep k8s-worker-2 default elasticsearch-client-1 1/1 Running 4 20d 10.100.140.28 k8s-worker-2 default elasticsearch-data-2 1/1 Running 0 4d2h 10.100.140.35 k8s-worker-2 default elasticsearch-master-2 1/1 Running 4 20d 10.100.140.30 k8s-worker-2 default gateway-mall-gateway-6567c8b49c-pc7rf 1/1 Running 0 15h 10.100.140.2 k8s-worker-2 kube-system calico-node-rlhcc 1/1 Running 0 2d5h 172.16.106.149 k8s-worker-2 kube-system coredns-59c898cd69-sfd9w 1/1 Running 4 36d 10.100.140.31 k8s-worker-2 kube-system kube-proxy-l8xf9 1/1 Running 6 36d 172.16.106.149 k8s-worker-2 monitoring node-exporter-mjsmp 2/2 Running 0 4d17h 172.16.106.149 k8s-worker-2 nginx-ingress nginx-ingress-nbhqc 1/1 Running 0 5d19h 10.100.140.34 k8s-worker-2 [root@k8s-master-2 ~]#

From the above results, we can see that there are 9 Pods on our worker node.

However, when we initially looked at the global resource information, we did not find that the resource usage of the entire worker node was high. This is because we have already limited the resources within the Pods. So let’s list the resource limits for each Pod:

For those Pods that do not consume high resources, we will not look into them.

Since there are resource limitations, let’s turn our attention back to the Gateway.

Targeted Monitoring Analysis #

By looking at the link time, we can also see that it takes a long time on the Gateway:

But what is this sendRequest for? I don’t know.

Let’s conduct an experiment and see what the TPS is after skipping the Gateway.

As can be seen, when going through the Gateway, the TPS can only reach around 400; when bypassing the Gateway, the TPS can reach over 800. Therefore, the problem indeed lies with the Gateway.

At this point, there is a missing link that we need to check, which is the health status of the Java process inside the Kubernetes container. Because we have checked the worker and the worker’s Pod, we have reached the third level, which is the Java application inside the Pod.

Regarding this, you don’t have to worry, just think about what can a Java application have? Nothing more than a heap and a stack. Let’s print out the stack of the Gateway.

From the stack, I couldn’t find anything wrong, the whole state seems quite reasonable. Please note that I’m not just looking at one screenshot here, I have gone through the entire stack. Since the CPU usage is not high, when analyzing the stack, we mainly look for any lock waiting. From the image above, we can see that there are no locks and all the waits are reasonable.

After examining the stack, the next step is to look at the heap. We need to find a way to extract the Java process heap from Kubernetes and take a look:

See! There is such a consistent relationship: the TPS and the Gateway’s GC trend are completely identical.

However, just looking at this is not specific enough, we need more detailed data. So, let’s go in and take a look at the GC status:

[root@gateway-mall-gateway-6567c8b49c-pc7rf /]# jstat -gcutil 1 1000 1000
  S0     S1     E      O      M     CCS    YGC     YGCT    FGC    FGCT     GCT   
  0.00  55.45  45.33  52.96  94.74  92.77  38427 1953.428    94  113.940 2067.368
 57.16   0.00  26.86  53.24  94.74  92.77  38428 1954.006    94  113.940 2067.946
  0.00  54.30  15.07  53.65  94.74  92.77  38429 1954.110    94  113.940 2068.050
 39.28   0.00  18.39  53.84  94.74  92.77  38430 1954.495    94  113.940 2068.435
 39.28   0.00  81.36  53.84  94.74  92.77  38430 1954.495    94  113.940 2068.435
  0.00  26.13  68.79  53.84  94.74  92.77  38431 1954.597    94  113.940 2068.537
 39.18   0.00  59.75  53.84  94.74  92.77  38432 1954.683    94  113.940 2068.624
  0.00  24.70  76.28  53.84  94.74  92.77  38433 1954.794    94  113.940 2068.734

Look, one YGC takes about 100ms, one YGC per second, so YGC accounts for about 10%, which is a bit too much time.

Since YGC consumes a high amount of CPU, let’s consider optimizing the Java parameters. First, let’s take a look at the Java parameters:

[root@gateway-mall-gateway-6567c8b49c-pc7rf /]# jinfo -flags 1
Attaching to process ID 1, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 25.242-b08
Non-default VM flags: -XX:CICompilerCount=2 -XX:InitialHeapSize=262144000 -XX:+ManagementServer -XX:MaxHeapSize=4164943872 -XX:MaxNewSize=1388314624 -XX:MinHeapDeltaBytes=196608 -XX:NewSize=87359488 -XX:OldSize=174784512 -XX:+UseCompressedClassPointers -XX:+UseCompressedOops 
Command line:  -Dapp.id=svc-mall-gateway -javaagent:/opt/skywalking/agent/skywalking-agent.jar -Dskywalking.agent.service_name=svc-mall-gateway -Dskywalking.collector.backend_service=skywalking-oap:11800 -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.port=1100 -Dcom.sun.management.jmxremote.rmi.port=1100 -Djava.rmi.server.hostname=localhost -Dspring.profiles.active=prod -Djava.security.egd=file:/dev/./urandom
[root@gateway-mall-gateway-6567c8b49c-pc7rf /]# 

From the above parameters, it can be seen that I did not configure any GC recycling parameters in the Java process on Kubernetes. So we will add the relevant parameters here.

In the following parameters, I added the PrintGC related parameters and the ParNew parameter:

[root@gateway-mall-gateway-6c6f486786-mnd6j /]# jinfo -flags 1
Attaching to process ID 1, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 25.261-b12
Non-default VM flags: -XX:CICompilerCount=2 -XX:CompressedClassSpaceSize=1065353216 -XX:+HeapDumpOnOutOfMemoryError -XX:InitialHeapSize=2147483648 -XX:+ManagementServer -XX:MaxHeapSize=2147483648 -XX:MaxMetaspaceSize=1073741824 -XX:MaxNewSize=1073741824 -XX:MetaspaceSize=1073741824 -XX:MinHeapDeltaBytes=196608 -XX:NewSize=1073741824 -XX:OldSize=1073741824 -XX:ParallelGCThreads=6 -XX:+PrintGC -XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution -XX:+UseCompressedClassPointers -XX:+UseCompressedOops -XX:+UseParNewGC 
Command line:  -Dapp.id=svc-mall-gateway -javaagent:/opt/skywalking/agent/skywalking-agent.jar -Dskywalking.agent.service_name=svc-mall-gateway -Dskywalking.collector.backend_service=skywalking-oap:11800 -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.port=1100 -Dcom.sun.management.jmxremote.rmi.port=1100 -Djava.rmi.server.hostname=localhost -Xms2g -Xmx2g -XX:MetaspaceSize=1g -XX:MaxMetaspaceSize=1g -Xmn1g -XX:+UseParNewGC -XX:ParallelGCThreads=6 -XX:+HeapDumpOnOutOfMemoryError -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCDetails -Xloggc:gc.log -Dspring.profiles.active=prod -Djava.security.egd=file:/dev/./urandom
[root@gateway-mall-gateway-6c6f486786-mnd6j /]#

Originally, I was hoping that ParNew could be of some use, but it didn’t have much effect.

Since adding parameters does not immediately yield results, we need to see what is being collected during YGC, and then decide where to start cleaning up the memory consumption of the Java process. So, let’s print the jmap histo information to see the changes in memory consumption of objects, as shown below:

[root@gateway-mall-gateway-6c6f486786-mnd6j /]# jmap -histo 1 | head -20


num     #instances         #bytes  class name
----------------------------------------------
   1:       2010270      124874960  [C
   2:        787127       91014984  [I
   3:        601333       42467920  [Ljava.lang.Object;
问题 原因 解决方案
YGC频繁,但对象内存回收情况良好 对象创建和销毁快 查看对象创建和销毁的delta
YGC消耗响应时间较长 未找到具体问题点 进一步分析GC log
ACPI缓存大小错误 内核与BIOS未协商一致 更新内核版本和加载相关模块
TPS未掉下来 未知 进一步分析问题

根据当前的情况,还需要进一步分析问题,以找到解决方案。

From the above data, it can be inferred that the decrease in TPS is closely related to Ingress. Let’s take a look at the logs of Ingress:

root@nginx-ingress-m9htx:/var/log/nginx# ls -lrt
total 0
lrwxrwxrwx 1 root root 12 Sep 10  2019 stream-access.log -> /proc/1/fd/1
lrwxrwxrwx 1 root root 12 Sep 10  2019 error.log -> /proc/1/fd/2
lrwxrwxrwx 1 root root 12 Sep 10  2019 access.log -> /proc/1/fd/1
root@nginx-ingress-m9htx:/proc/1/fd# ls -lrt
total 0
lrwx------ 1 root root 64 Jan  7 18:00 7 -> 'socket:[211552647]'
lrwx------ 1 root root 64 Jan  7 18:00 4 -> 'anon_inode:[eventpoll]'
lrwx------ 1 root root 64 Jan  7 18:00 3 -> 'socket:[211552615]'
l-wx------ 1 root root 64 Jan  7 18:00 2 -> 'pipe:[211548854]'
l-wx------ 1 root root 64 Jan  7 18:00 1 -> 'pipe:[211548853]'
lrwx------ 1 root root 64 Jan  7 18:00 0 -> /dev/null
root@nginx-ingress-m9htx:/proc/1/fd# find ./ -inum 212815739
root@nginx-ingress-m9htx:/proc/1/fd# find ./ -inum 212815740

It’s so frustrating! You see, the logs are directly redirected to standard output and standard error, and by default, both are displayed on the screen. So, let’s trace the logs in the Kubernetes management tool. However, the result is empty. Oh, what to do now?

From the following image, we can also see that when the pressure goes through this Ingress, errors inevitably occur, and the greater the pressure, the more errors occur.

However, even after analyzing up to this point, we still don’t have any other logs to analyze. There is not much we can do except checking the version of Ingress, and it turns out that there is a new version available for the current Ingress.

To avoid stepping into some pitfalls of Ingress itself, I switched its version from 1.5.5 to 1.9.1, and got the following result:

As you can see from the image, there are no errors anymore, so it seems that those errors were caused by the Ingress version.

However, even so, we still haven’t solved the issue of TPS dropping. You might say that the TPS in the image above did not drop, right? Well, that is just an illusion. In the above scenario, we only conducted the test for the purpose of verifying the issues with Ingress, so the execution time was not long.

Please note that we have not solved the previous mentioned problem of TPS dropping so far. We may have two problems here, one is Ingress, and the other may be somewhere else, but we have not verified it yet. Therefore, we need to go back to the main line and continue analyzing it.

Back to the main line #

After all the twists and turns, do you feel dizzy? When we are deeply involved in some technical details, we must stay alert.

Based on my experience, at this point, we can draw an architectural diagram on paper. It doesn’t mean that we don’t need to draw the architecture diagram because we already have it before. Drawing the architecture diagram helps us organize our thoughts. And we need to draw it in more detail:

After organizing, I use the segmented testing method to determine which layer the problem is related to: since the Cart service needs to call externally through the gateway, I directly call the Cart service here without bypassing the gateway. And I also skip Ingress and provide the service directly using NodePort to see if TPS drops.

First, I directly push the pressure into the NodePort of the cart service and get the following result:

This means that the Cart service itself will cause the TPS to drop, and it doesn’t seem to be regular.

So I modified the Tomcat parameters and increased the number of threads and connections, and tried again. You may wonder why I adjusted it this way? This is because when checking the health status of application threads, I noticed that the Tomcat threads in Spring Boot are very busy.

After verifying this several times, I found that TPS did not drop.

To further verify if TPS is related to the Tomcat parameters such as thread count and connection count, I changed the configuration back and checked if it’s the issue with the Tomcat parameters.

The result is that the TPS drop issue did not reoccur!

I was so frustrated that I had to eat some spicy hot pot to vent. Originally, I had already seen that TPS dropping is related to garbage collection (GC). Furthermore, through analysis in the GC part, I found that the hashmap$node in the servlet was being created and reclaimed rapidly, indicating that the YGC consumes more resources when the pressure is larger. Therefore, I adjusted the Tomcat-related parameters. However, now, under the same pressure, the problem cannot be reproduced, and it’s really frustrating.

Random problems like this are quite difficult to solve. I wonder if the illusion of stable TPS is related to a restart in the middle. Speaking of the restart method, it is indeed an absolute tactic in the technical field.

Since this problem cannot be reproduced, and the live site logs are no longer available, we can only give up.

Although we have a logical analysis from the beginning to the end regarding this problem, we still haven’t found the root cause. If it is a random problem, then it means that we haven’t grasped the cause of the problem at the right time.

For a project, if a random problem has an unacceptable impact on the business, then we must spend a lot of effort on solving it. If the impact is not significant, then we can put it aside for now. However, every problem has its inevitability, which means that those seemingly random problems actually have absolute inevitability in technology.

So, what is the cause of this problem? Here, I will leave it as a suspense, because if we continue analyzing it, this lesson will be too long, and you may also feel tired. We will continue the analysis in the next lesson.

Summary #

In this lesson, we discussed performance analysis in two stages.

The first stage is relatively simple, which is about query optimization. For queries, it is best to have precise lookups during real-time transactions. If range queries are required, pagination is necessary.

However, for large range queries, it not only puts pressure on the network but also significantly affects various layers such as applications and databases. Therefore, when range queries occur, we must make good technological choices. When the business requires such range queries, you can consider switching components, such as using big data solutions.

The second stage is a bit troublesome, although we spent a lot of time and effort, in the end, we did not find the root cause. However, our analysis direction and approach are correct.

Random-appearing issues like this often occur in real projects as well. In our analysis, we may eventually find that it is a very simple problem, which can be quite frustrating. We will explain the root cause of this problem in the next lesson.

Anyway, in this lesson, we have still provided a complete description of the analysis logic, hoping to give you some comprehensive thoughts.

Homework #

Finally, please take some time to think about the following questions:

  1. In real-time trading, how can we quickly identify performance issues caused by the amount of data in the database? What is the evidence chain for targeted analysis?
  2. How can high CPU usage be traced to performance issues caused by GC efficiency?

Remember to discuss and exchange your ideas with me in the comments section. Each thought you have will help you progress further.

If you have gained something from reading this article, feel free to share it with your friends and learn and grow together. See you next lesson!