24 Capacity Scenarios First Index Optimization and How to Deal With Uneven Distribution of Kubernetes Resources

24 Capacity scenarios first index optimization and how to deal with uneven distribution of Kubernetes resources #

Hello, I’m Gao Lou.

As we know, the purpose of capacity scenarios is to answer the question of “how much maximum capacity can be achieved online.” This requires us to be very rigorous in designing and executing capacity scenarios. Of course, this also means that capacity scenarios will be a challenging process. Through this lesson, you will be able to deeply experience this.

Today, we will focus on addressing the issues of index optimization and uneven Kubernetes scheduling. Regarding index optimization, you may wonder why we need to look at index issues when we have already covered benchmark scenarios. Indeed, it is puzzling. From this, we can see that capacity scenarios and benchmark scenarios are really different because there are issues of business impact involved.

The problem of uneven Kubernetes scheduling will result in multiple pods running on the same worker. Such issues cannot be seen in capacity scenarios, so I hope this can serve as a reference for you when dealing with similar problems.

In addition, we will also look at what to do when the response time keeps increasing under stable pressure. This kind of problem is common, but the specific points of failure are different each time. This time, you will see a concrete case.

Alright, let’s get started!

Scenario Execution Data #

First Execution #

I have to admit that I was quite nervous when I came to the Capacity Scenario for the first time.

First of all, we carefully set the scale of the capacity scenario, which is the business ratio we mentioned in Lesson 5. Then, we set the corresponding parameters and associations. After that, we executed the capacity scenario and obtained the following information:

Suddenly, we felt overwhelmed, didn’t we?! But “deal with challenges and find solutions” is our motto, right? Since there are errors, let’s solve them first.

Let’s take a look at the error message:

It seems that there are problems with two scripts. After running these two scripts separately, I found that the parameterized data was set incorrectly. Because each interface was running as a single script before, but now we need to run them together, so the parameterized data needs to be reconfigured. I won’t go into detail about this simple error, as long as you are careful, you can find the solution quickly.

After fixing this simple script error, let’s execute the capacity scenario again:

Regardless of the performance, you can see that there are no error messages now, right? Very well, let’s continue to look at other issues in the capacity scenario.

Second Execution #

After solving the first problem, we ran the scenario for a while and saw the following scenario execution data:

The data doesn’t look good, there are still failures. Let’s see what happened:

From the data, the error messages should not be related to the scripts; it is a problem with some component that caused all scripts to fail to run properly.

By checking the Pod interface of the global monitor, we noticed the following phenomenon:

Considering that we have set the resource limit for Elasticsearch quite strictly (only 1C) to prevent Elasticsearch from affecting other services, as analyzed in Lesson 15, although we have improved the performance by scaling up the Elasticsearch POD, I have reverted it so as not to affect the subsequent optimization. Here, let’s disable the business that uses Elasticsearch, that is, querying products, and run the capacity scenario again.

Third Execution #

Let’s take a look at the results of the third execution:

There are still a few errors, let’s first see where the errors are, and then solve the problems with long response times.

Here, let me explain one point. For errors, we can first see what the error message is and if there is any problem with the script. However, since we are already in the capacity scenario, if there is a problem with the script, it will be reflected in the baseline scenario. Therefore, at this time, it is more rational to check the application logs.

So, by checking the logs layer by layer, we noticed the Cart information:

2021-01-26 00:20:43.585 ERROR 1 --- [o-8086-exec-669] o.a.c.c.C.[.[.[/].[dispatcherServlet]    : Servlet.service() for servlet [dispatcherServlet] in context with path [] threw exception [Request processing failed; nested exception is feign.RetryableException: timeout executing GET http://mall-member/sso/feign/info] with root cause
java.net.SocketException: Socket closed

Obviously, on the Cart service, we see that there is a timeout error in the remote call (Feign call), and it calls the member service.

Since it is an error caused by the Cart service calling the Member service, let’s now look at the logs of the Member service:

2021-01-26 00:20:46.094 ERROR 1 --- [o-8083-exec-308] o.a.c.c.C.[.[.[/].[dispatcherServlet]    : Servlet.service() for servlet [dispatcherServlet] in context with path [] threw exception [Request processing failed; nested exception is feign.RetryableException: timeout executing POST http://mall-auth/oauth/token?password=123456&grant_type=password&client_secret=123456&client_id=portal-app&username=7dcmtest13657257045] with root cause
java.net.SocketTimeoutException: timeout

Look, the Member service shows an error when reading from Auth. However, when I checked the Auth service, I found that it was empty, and there were no errors.

Based on my experience, the error may be due to a momentary problem that resulted in a read timeout. We’ll have to wait for it to happen again before dealing with it.

Since we haven’t found the specific problem for the error message yet, let’s continue running the scenario. Let’s see the results:

As you can see, there are indeed a few errors. Overall, it’s just a few read timeout issues, which is normal in performance testing. Moreover, there are still single Java applications in the current deployment structure, so this situation is unavoidable.

However, from the above figure, we can see a more obvious problem, that is, the TPS is continuously decreasing. We need to solve this problem first.

Phase 1 Analysis #

Directed Monitoring Analysis #

Although the overall TPS is not high, our first problem is that the TPS will continue to decrease under sustained pressure.

Since we are currently running a capacity scenario that includes multiple businesses, we need to determine which business is causing the TPS trend shown in the graph. In a capacity scenario where scripts are executed in series, when one script encounters this situation, all businesses will slow down, even those that are not slow themselves, and the TPS will be affected.

However, for the business interfaces without any problems, their response time will not be affected. The business interfaces with longer response times naturally become our main focus.

Through investigating each business interface one by one, I found that the response times of three businesses are increasing as the scenario continues:

- -

These three businesses are “Query Order List before Payment”, “Payment Order Information”, and “Query Order Details after Payment”. However, in the analysis of the baseline scenario, they did not exhibit this phenomenon, which means that they have newly developed issues in the capacity scenario.

Since these three business scripts all have this issue, let’s continue to analyze where the time delay specifically occurs. Let’s start with the simplest business chain, the “paySuccess” (Payment Order Information) interface.

We have already discussed the architecture diagram of this interface in Lesson 22, which is JMeter - Gateway - Order - Mysql (if other technical components are added, the architecture diagram will be more complex. However, we have not seen their impact yet, so I won’t include them here for now).

By gradually examining the performance analysis decision tree’s global monitoring counters, I found that a SQL statement is becoming slower:

SELECT id, member_id, coupon_id, order_sn, create_time, member_username, total_amount, pay_amount, freight_amount, promotion_amount, integration_amount, coupon_amount, discount_amount, pay_type, source_type, STATUS, order_type, delivery_company, delivery_sn, auto_confirm_day, integration, growth, promotion_info, bill_type, bill_header, bill_content, bill_receiver_phone, bill_receiver_email, receiver_name, receiver_phone, receiver_post_code, receiver_province, receiver_city, receiver_region, receiver_detail_address, note, confirm_status, delete_status, use_integration, payment_time, delivery_time, receive_time, comment_time, modify_time
FROM oms_order
WHERE (delete_status = 0 AND member_id = 277673)
ORDER BY create_time DESC LIMIT 4, 1;

Next, let’s take a look at the indexes for this SELECT statement:

Huh, why are there two possible indexes? Let’s take a closer look, and indeed, there are two indexes on the same column.

Let’s delete one of them for now. However, deleting the index is not to solve performance issues, we are just making a quick fix.

Now, the data displayed on the index is clear, as the execution of the capacity scenario continues, this SQL statement is querying more and more rows. Since this SELECT statement is becoming slower, let’s check what the histogram is like for the conditions it is based on. I found a histogram like this:

You see, there are many orders under one user, indicating a situation where orders are heavily inserted. Therefore, let’s check if the user information is also duplicated in the parameterized data file.

Through investigation, we found that there are indeed a large number of duplicate data in the parameterized text. Let’s sort out the parameterization logic of the business script:

Since the order data is duplicated, let’s trace it back and it must be due to duplicate shopping cart data. If we trace further, it should be the member_id that appears duplicated when using the user token. After checking all the business scripts, we found that it is the token generated by the script for obtaining user parameters:

And the data in this script uses such a parameter file:

This parameter file only contains 100 data records, but there are a large number of duplicate member_IDs among these 100 records. In this case, the problem becomes clear: it turns out that in the data generation phase, the Member data we initially used already had problems, which caused a large number of duplicate order data.

Now that we have identified the cause of the problem, we need to regenerate the data. At this point, we must make sure to use unique member_ids for generating order data and check for duplicates in the generated data at each step.

After some tweaks, the TPS is as follows:

This optimization effect is very good. By solving the problem of parameterization duplication, we have addressed the issue of generating a large number of orders under the same user, thus resolving the problem of the increasing response time that we observed earlier.

For this parameterization issue, the difficulty lies in the fact that parameterization is done during login, and the parameters used in the order script rely on the data generated in the login script. However, due to the duplication in the previous parameterization, subsequent parameters are also duplicated. For such problems, we need to patiently trace the source of the parameterized data.

At this point, we have solved the problem of increasing response time, but our optimization is not over. As you can see from the response time window in the graph, some interfaces have long response times despite the not-so-high overall TPS. Therefore, we need to conduct phase 2 analysis.

Phase 2 Analysis #

According to our performance analysis logic, the next steps are to split response time and view global monitoring data, and both steps need to be done.

As you may have noticed, when analyzing, sometimes we need to split response time, and sometimes we just look at global monitoring data. This is because when we can see the problem from the global monitoring data, we can proceed with the analysis without splitting the response time. We just need to follow the seven-step analysis method and continue with the analysis.

If we cannot see obvious resource consumption from the global monitoring data, and at the same time, the response time keeps increasing, then we must split the response time to analyze it accurately.

Global Monitoring Analysis #

After viewing the global monitoring data, the resource utilization of the database is as follows:

As you can see, the CPU resource utilization of worker-1 is relatively high. Let’s go into this machine and take a look at the top command:

Obviously, multiple CPUs have reached 100% utilization. From the process table, we can also see that the MySQL process consumes the most CPU. The following analysis logic is clear (already described in Lesson 16).

Targeted Monitoring Analysis #

This is a typical problem of high CPU utilization in the database, and it is relatively simple. We still look at the global monitoring data of MySQL, check SQL, check execution plans, and create indexes. The problematic SQL is as follows:

Similarly, the “ALL” value in the “type” column tells us that this SQL is doing a full table scan. After creating an index on the corresponding table, the execution plan is as follows:

From the “ref” value in the “type” column, we can see that we have created the index.

Next, we run the capacity scenario again and get the following global monitoring graph:

As you can see, the CPU utilization of worker-1 has decreased.

Let’s take a look at the results of running the scenario:

Hmm, the effect is pretty good. This is the effect that the index can bring. After adding the index, TPS increases significantly. I have mentioned this repeatedly in the previous lessons, so I won’t go into detail here.

Someone may ask, why didn’t we add the index in the previous baseline scenario? Did the interface not encounter a bottleneck because of the lack of an index?

To be honest, I also had this question. Think about it, this SQL is used when generating orders and is related to coupons. After I deleted the index and re-ran the generate order interface, I found that the TPS could still reach the peak value of the previous baseline scenario. It can be seen that the problem with this SQL did not show up in the previous scenario, but only appeared in the mixed scenario.

I thought that after adding the index, TPS would be normal for a while. However, unexpectedly, the next day when I ran the scenario, I still saw the results that I didn’t want to see. We will analyze the specific problem in the next phase.

This project is full of challenges at every step. However, it is precisely because of this challenging project that we have more opportunities for analysis. Now let’s see what exactly went wrong.

Stage 3 Analysis #

Global Monitoring Analysis #

After executing the capacity scenario continuously, these are the results of global monitoring:

Note that this does not mean we don’t need to break down response times. I have already reviewed the breakdown data of response times and found that the majority of time is consumed by the Order service. Since we have already performed multiple breakdowns of response times in previous lessons, I will not provide specific data here.

In the first interface of the global monitoring data, we can already see which nodes have high resource consumption. According to the data, the CPU usage of worker-3 has increased to over 90%, and the CPU usage of worker-4 has also increased to over 70%, resulting in a continuous drop in TPS. Now let’s analyze this further.

Directed Monitoring Analysis #

Let’s enter worker-3 and execute top to see which services are running on it:

top - 23:21:36 up 11 days,  5:37,  5 users,  load average: 40.53, 49.79, 53.30
Tasks: 335 total,   1 running, 332 sleeping,   2 stopped,   0 zombie
%Cpu(s): 82.5 us,  8.7 sy,  0.0 ni, 3.0 id,  0.0 wa,  0.0 hi,  3.2 si,  2.6 st
KiB Mem : 16265984 total,  2802952 free,  7089284 used,  6373748 buff/cache
KiB Swap:        0 total,        0 free,        0 used.  8759980 avail Mem 


  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                 
26573 root      20   0 8930960 837008  15792 S  231.5  5.1 113:46.65 java -Dapp.id=svc-mall-order -javaagent:/opt/skywalking/agent/skywalking-agent.jar -Dskywalking.agent.s+
26973 root      20   0 8920512 810820  15776 S  173.7  5.0 112:24.54 java -Dapp.id=svc-mall-order -javaagent:/opt/skywalking/agent/skywalking-agent.jar -Dskywalking.agent.s+
24386 root      20   0 8864356 702676  15764 S  98.7  4.3 295:33.69 java -Dapp.id=svc-mall-portal -javaagent:/opt/skywalking/agent/skywalking-agent.jar -Dskywalking.agent.+
17778 root      20   0 8982272 803984  16888 S  97.4  4.9 375:15.37 java -Dapp.id=svc-mall-portal -javaagent:/opt/skywalking/agent/skywalking-agent.jar -Dskywalking.agent.+
 1087 root      20   0 2574160 132160  31928 S  25.6  0.8   1637:21 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubern+
25589 root      20   0 8839392 585348  15772 S  20.8  3.6 160:58.44 java -Dapp.id=svc-mall-auth -javaagent:/opt/skywalking/agent/skywalking-agent.jar -Dskywalking.agent.se+
 1095 root      20   0  998512  86168  13100 S   6.5  0.5 837:37.56 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock                                  
29226 root      20   0 8906120 881632  13700 S   5.8  5.4 760:36.91 java -Dapp.id=svc-mall-search -javaagent:/opt/skywalking/agent/skywalking-agent.jar -Dskywalking.agent.+
28206 root      20   0 7960552 341564  15700 S   4.9  2.1  66:28.23 java -Dapp.id=svc-mall-search -javaagent:/opt/skywalking/agent/skywalking-agent.jar -Dskywalking.agent.+
 9844 root      20   0 1632416  47092  16676 S   2.9  0.3 559:35.51 calico-node -felix                                                                                      
 9646 polkitd   20   0 4327012  97744   4752 S   2.6  0.6  25:26.93 /usr/local/lib/erlang/

As you can see, various applications are running on this machine, which is very inconsistent with normal architecture.

As mentioned earlier, during the breakdown of response times, it was found that the Order service consumes the most time. Since the Order service is the application that currently requires the most resources, let’s first remove services like Auth and Portal.

After removing some services, let’s see how the scenario execution data looks like and then decide whether further directed monitoring analysis is needed.

To ensure that our bottleneck judgment is correct, let’s take a look at the global monitoring data:

Now let’s check the data of the scenario execution:

As you can see, the TPS is still quite normal, right? It seems that we don’t need to continue with directed analysis. However, are these results really normal? Can we end our capacity scenario? To find out what happens next, stay tuned for the next breakdown.

Summary #

The analysis process of our class was quite bumpy. The scenario ran three times alone, indicating that even though the baseline capacity has passed the test, the capacity scenario is not so easy to run.

In the first phase of the analysis, we mainly focused on the problem of TPS increasing with time when the pressure of the thread remained unchanged. Please note that when the pressure remains constant, a stable TPS is normal.

In the second phase of the analysis, we identified the problem of an unreasonable index. This kind of problem is relatively common and easy to locate and optimize.

In the third phase of the analysis, we solved the problem of excessive concentration of resources. I performed a reallocation of resources, and it seems to be working fine now.

We need to make clear judgments about resource allocation. In Kubernetes, the balanced allocation of resources often relies on the scheduling capability of Kubernetes. However, Kubernetes scheduling is determined at the moment of scheduling, and resource consumption during sustained pressure may not be reasonable. Therefore, we need to make adjustments to pods during the duration of the pressure.

Homework #

That’s all for today’s content. Finally, I have two questions for you to ponder:

Why is it unreasonable for the TPS curve to decrease and response time to increase under constant stress?
When resources are too concentrated, how can you locate the impact between Pods? Do you have any different methods from what was discussed in this lesson?

Remember to discuss and communicate your thoughts with me in the comment section. Each thought will bring you further.

If you found this lesson helpful, feel free to share it with your friends and learn together. See you in the next lecture!