23 What Are the Key Factors in Determining the Failure of Capacity Scenarios

23 What are the key factors in determining the failure of capacity scenarios #

Hello, I am Gao Lou.

Starting from this lesson, we will enter into the analysis of capacity scenarios.

In the current performance market, if you ask a performance specialist to design a capacity scenario, they may not know how to do it, and you may also feel uncertain. This is because there are too many prerequisites for designing a capacity scenario, and many people may feel at a loss on where to start.

Although we described the general content of capacity scenarios in Lesson 5, it was not detailed enough and may not have garnered enough attention from you. However, we cannot underestimate the importance of capacity scenarios. Therefore, in this lesson, I will show you the key points in capacity scenarios to help you design and execute them more effectively.

Capacity Scenario Objectives #

First of all, it is important to remember that capacity scenarios must have objectives. Without objectives, a capacity scenario will not have a clear endpoint. Therefore, in my lecture on performance solutions, I specifically mentioned the objectives of a performance project.

Regarding capacity scenarios, the objective I wrote is to “achieve the optimal operational state of the system.” However, such an objective is not specific enough in a project. If you want to be more specific, you need to clearly state the maximum capacity of a system.

Please note that in many performance projects, the lack of specific capacity scenario objectives, or the presence of “rogue requirements that only provide maximum TPS without the business model,” result in the execution of capacity scenarios being directionless. This is not acceptable.

So how can we define specific objectives for a capacity scenario? Let me give you an example. For instance, if the system capacity requirement is to reach 1000 TPS, this 1000 TPS is an overall value that will be further broken down into individual businesses using the corresponding business model. This “1000 TPS” is the overall indicator and objective of the capacity scenario. And the corresponding business model should look like this (see example table below):

However, having only proportions without specific indicators is not adequate because there is no standard for completion. Therefore, we need to provide an accurate definition for the capacity objective as well (see example table below):

In this way, we not only have proportions but also optimization and project completion objectives. This makes the capacity scenario objectives very specific.

Business Model #

Just now we talked about how capacity scenarios need to be executed under a specific business model, and we also provided an example of a business model in the previous paragraph.

Regarding the business model, let me make it clear to you: when we perform capacity scenarios, we need to include all concurrent businesses (you can combine interfaces to form businesses). In a business system, only when all business interfaces are included can it be considered a true capacity scenario.

Moreover, the proportions of the business model should conform to real business scenarios in production (please refer to the “Business Model and Performance Metrics” section in Lesson 5 for details). This means that in a capacity scenario, the business model must cover the peak values of business and the maximum resource utilization peak values in the production environment.

Previously, I saw some people directly perform performance tests on each interface and consider a performance project complete. This is not reasonable because the interfaces in a system are parallel, not all serial.

Since business interfaces are parallel, there must be a relationship of before and after execution, as well as a proportional relationship. This involves the source of the business model.

As we mentioned in Lesson 6, to extract a business model that matches the real business scenario, the normal logic is to first collect statistics on the business in the production environment and provide the corresponding business proportions, then set the business proportions in the load test tool, and ensure that the proportions in the result match the statistics.

This is the corresponding flow chart, I hope you still remember:

Please remember that the reason we strive to make all business models match the production model is to give a clear answer to whether the “production peak can support the business” when performing capacity scenarios. So the source of the business model is crucial.

At the same time, some people have said, what should I do if I can’t obtain production data for statistics? I think this is really not a technical issue. If you don’t have the necessary permissions, you can report it to your superiors; if the company’s system does not support it, you can coordinate with relevant personnel in the company to find a solution; if you complain without making any effort, it can only be said that you haven’t tried hard enough; if you have tried hard but still can’t get results, it means that the relevant personnel in your company do not have sufficient understanding of performance.

In the performance industry, the concept of “the business model should be derived from the production environment” still hasn’t deeply rooted in every company, which is a tragedy for the industry.

Some people say, what if my system is new and there is no production data? Well, there are still solutions.

Firstly, the appearance of every project must be driven by business requirements. If there are similar business systems in the market, you can seek reference from their data. Usually, every project’s business personnel have experience with similar types of business, so they can provide it. Remember, they are only giving you the business requirements; the technical requirements still need to be refined by the relevant personnel in the performance project.

If there are no similar business systems in the market and no data to refer to, I suggest you try my approach: in the projects I have been involved in, there is usually a trial operation phase. Because we cannot be sure what the system will be like after it goes live, we first run a trial operation for a period of time in order to make the necessary adjustments. The data from the trial operation can be used for reference.

If you say, our system does not have similar system data to refer to, and there is no trial operation phase, what should we do? Then I can only say, just apply any kind of load, it doesn’t matter what load you apply, and it doesn’t matter if the load is proper or not.

There are now some tools available that can replicate production traffic, and one of their goals is to solve the problem of consistency between business models and production. This is an effort being made in the industry. Regardless of the methods we use, whether it’s traffic replication or business model statistics, they are all reasonable ideas and approaches.

In addition to this, when the business model needs to be configured in the capacity scenario, another question that is often asked is how to use the load test tool to implement specific business proportions. Regarding this, you can find the answer in Lesson 5: if you are using JMeter, you can use the Throughput Controller to control the business proportions. Of course, if the load test tool you are using has a similar feature, it can also be used.

Whether the business model can truly be implemented in the capacity scenario, another crucial action is to compare the results of the scenario with the proportions in the business model after the execution of the capacity scenario. If they are consistent, then it is an effective capacity scenario; if they are not, then you need to start over. You must not forget about this action.

Data Volume #

Regarding the data volume in capacity scenarios, I hope you remember: when using parameters in capacity scenarios, try to avoid creating data artificially. This is because we use all the interfaces in capacity scenarios, and these interfaces have logical relationships with each other. Therefore, I recommend that you pass parameters based on the business logic rather than creating artificial data. If it is really impossible to achieve, then consider creating data.

At the same time, we need to consider the data volume required in capacity scenarios. Please refer to the data preparation section in Lecture 5 and the specific descriptions in Lecture 7. Here, I would like to emphasize a few key points:

The parameterized data in capacity scenarios must be consistent with the actual user usage rules and data volume in production. I have emphasized this many times before, but some people still ask me: can we use a small amount of parameterized data to achieve production-level stress? Here, I firmly say again: no!
The baseline data must be appropriately reduced through calculations, and it is best to be consistent with production. How to do this reduction? I suggest you do a benchmark scenario comparison. How to compare? Under the same level of stress and data volume, compare the resource usage and various counters in the test environment and production environment to see the differences.

Regarding the data aspect, these are my main two pieces of advice. I hope you can remember them.

Monitoring Design #

We have already described in detail the strategies for global and targeted monitoring design in Lesson 9, so I won’t repeat them here. However, I want to emphasize a few points that I hope will catch your attention:

First, all counters in global monitoring should correspond to the project-level performance analysis decision tree. This point is easy to understand but not necessarily easy to implement, as each monitoring tool may not be comprehensive or may have different design principles.

If your company has a fixed system and is constantly evolving, it is best to have your own monitoring platform design philosophy. As mentioned in Lesson 9, global monitoring comes from architectural analysis, and with this logic, you can understand why there is such a strong correlation between performance and architecture.

Second, targeted monitoring should only be done after identifying performance bottlenecks. This is because there are too many branches, and it’s basically impossible to accurately guess the direction of targeted monitoring from the start.

Next, don’t worry too much about choosing a monitoring tool. As long as it can accurately collect counter values, any tool will do.

Lastly, in capacity scenarios, all business components involved should have “segmented-layered” coverage in global monitoring.

In the specific process of performance analysis, you will find that our understanding of counters and our judgment of directions are closely related. For example, when we see high “us cpu”, we naturally think about looking at the user-level application stack to see what code is being executed; when we see high “sy cpu,” we immediately think about investigating who is calling the syscalls.

Therefore, the difficulty in monitoring lies in understanding counters, and understanding counters requires understanding the principles of technical components.

When you see the value of a counter but don’t know what to do next, it means you haven’t understood that counter, and you need to acquire relevant knowledge. Just like someone who doesn’t know how to cook and asks whether to beat the eggs after the water boils or with cold water, you can imagine how disastrous the result would be if you beat the eggs with cold water, right?

Pressure Strategy #

Let’s take a look at the pressure strategy in the capacity scenario. In the same Lesson 5, I emphasized that there are two key words in the pressure strategy for performance scenarios: incremental and continuous. These two points must be achieved in the capacity scenario.

Some may ask, continuous is easy to understand, but what kind of incremental approach is it? We have described this in the baseline scenario in Lesson 10, and I have also performed the corresponding calculations for the pressure threads. I won’t repeat it here. If you have similar questions, I suggest you review it in detail.

I want to remind you that the pressure threads in the capacity scenario may not have obvious steps, but they must be incrementally increased. Why? Think about it, is it possible for the real user volume in the production scenario to suddenly reach the peak user volume? Obviously, this is not realistic!

So, what would the request trend look like when we view it on the server if we achieve incremental and continuous pressure? Let me show you a chart:

From this chart, you can see that the pressure is continuously increasing, which is what it looks like in production.

Of course, if you want to quickly increase the pressure to find bottlenecks, you can directly apply high pressure. However, this scenario is not used as a conclusive scenario. We only use it in the performance analysis process.

In the pressure strategy of the capacity scenario, there is a concept that is often discussed: collection point. Many people believe that it is very useful, but in my opinion, it does not match the real production scenario.

So, do we need collection points?

Some would say, isn’t it reasonable to use collection points to simulate multiple users operating simultaneously? Then I have a question: where are the collection points of the pressure tool? Are they collected on the server? Obviously not, collection points can only be collected on the end of the pressure tool. And after the collected requests have gone through actions such as CPU contention, network interruption, network transfer speed, and protocol conversion, do they still collect on the server?

To take a step back, even if it can be collected on the server, can the server process requests beyond its own capacity? Obviously not.

Words alone are not enough, let me show you another chart:

This chart shows the statistical results of the server’s request logs after adding collection points in the same scenario. I then refined the results into 20-millisecond granularity. You see, the trend of the requests is actually intermittent, right? This fully demonstrates “what the performance of collection points on the server is like”.

Now that you know the answer to the question, combined with your specific project, you can carefully consider whether you need such collection points.

Startup Conditions #

In Lesson 5, we mentioned startup conditions. In many enterprises, these startup conditions are just items in documents that are ignored during actual project execution.

Startup conditions are actually related to project management. In the performance projects I have experienced, if the startup conditions are not taken into account, the performance team will bear the blame for consuming too much time.

Think about it, if the startup conditions cannot be met (such as undefined version, unfinished functionality, inconsistency between architecture and production, environmental differences, etc.), it will lead to testing and changing at the same time, which is very time-consuming and needs to be done repeatedly. This not only increases the cost of performance projects but also reduces the value of performance personnel, and also reflects the chaotic problem of project management.

My suggestion to you is to carefully read the startup conditions. During the project execution, whoever is responsible for the blame should shoulder it. If you are willing to take the blame, then don’t care about those items.

Organizing Work of Coordination #

Finally, let’s take a look at the work of organizing coordination in the context of capacity scenarios. In some large projects, there may be multiple project teams involved in capacity scenarios. Therefore, it is important to determine the contact person for each project in advance to avoid situations where the scenario is running but no one is managing the system.

For a good performance project manager, it is important to minimize the involvement of performance execution personnel in coordinating resources and overseeing the resolution of performance issues. From a perspective of authority and practicality, it can be quite challenging for performance execution personnel to coordinate with other project teams, and they may even lack support in doing so.

Summary #

After reading this, do you feel that capacity scenarios are quite complex? Indeed, they are.

Capacity scenarios not only require attention in terms of technology, but also in terms of management. The reason why I have provided so many descriptions about capacity scenarios is because they are extremely important for performance projects. The widely mentioned end-to-end load testing in the current market is actually a specific scenario of capacity scenarios.

Therefore, in this lesson, I have reiterated the important aspects of capacity scenarios, hoping that you will attach importance to them. Additionally, I would like to remind you that even if all the benchmark scenarios perform well, it does not guarantee smooth capacity scenarios. So, be prepared for that mentally.

Homework #

That’s all for today’s content. Here are two questions for you to think about:

If you were asked to design a monitoring strategy for a capacity scenario, what would you do? Please describe your design logic.
How is the pressure strategy for capacity scenarios designed in your project? Please describe your own pressure strategy design logic.

Remember to discuss and exchange your thoughts with me in the comment section. Each thought process will help you make further progress.

If you found this lesson helpful, feel free to share it with your friends and learn and improve together. See you in the next session!