30 How to Determine the Production System Configuration

30 How to determine the production system configuration #

Hello, Gao Lou.

Configuring a production system has always been a task for the operations team within the scope of performance “testing”, and it has nothing to do with our “testing”.

However, in the first lesson, I emphasized that in my RESAR performance engineering concept, performance engineering should take into account the operations phase. This seemingly small change actually expands the scope of work for the performance team, which is not easy to execute, especially for companies where the operations and performance “testing” teams are seriously disconnected.

Let’s not talk about whether the performance “testing” team can provide the desired configuration for production. Many performance “testing” teams may not even know the current production configuration. Faced with this situation, I believe that if we still restrict ourselves within the “testing” team, we will definitely not be able to make any contributions.

Let’s think about the goals of performance projects, and it will be easy to understand this point. Usually, when setting goals, we say: ensure the normal operation of the online system.

This goal seems to be achieved in a performance project, but how is it actually done in the current performance industry? If you are a performance “testing” engineer, have you ever seen what the production environment looks like? Have you ever gotten any data? Have you analyzed any performance parameters? Even worse, you may not have even seen the actual machines. In such a situation, a performance project can only find obvious software performance bottlenecks in the system.

The capacity of a system as a whole is not only determined by software, but also by a series of software and hardware components such as hardware environment, network, storage, load balancing, and firewall. If the performance team does not understand these, then they cannot be expected to provide any production configuration.

When we move this problem to the production environment, experienced members of the operations team may be able to provide reasonable performance parameter configurations. However, are these parameter configurations aligned with the current business goals? Most likely, most of the operations team will go live first and then fine-tune and calibrate the parameters. And this means that the system is unstable from the beginning of the go-live.

Therefore, in my opinion, it is most reasonable for the performance team to provide the performance parameter configuration for the production environment.

Projecting Production Capacity #

Before determining the performance parameter configuration, we need to project the approximate production capacity. It doesn’t need to be extremely accurate, a rough estimate such as “around 1000 TPS” will suffice. In fact, this is to estimate the capacity level of a system.

As shown in this chart, we need to roughly estimate how much resources each service will consume under different capacities. Then, we strive to balance the use of resources and reduce costs.

People often ask questions like: how do we evaluate the capacity of a system? For example, if we obtain a machine configuration of 4C8G, how do we evaluate how many TPS this machine can achieve in a system we have tested?

In fact, we can start with the simplest method: benchmark testing.

A student once asked me how many TPS an 8C16G machine can achieve. I replied that I didn’t know because I wasn’t clear about the business. If it’s a business I haven’t tested, then I have no experience data either. So, I suggested her to do a benchmark test, even if it’s just a simple CRUD service without any business logic, she could still find out how many TPS it can achieve.

Based on my experience, on my 2C4G machine, if I only run the simplest query interface without any business logic, achieving 1000 TPS (one T is one API request) is not a problem.

The student was quite serious and went back to test a simple service, then told me that the 8C16G machine can achieve three or four thousand TPS. This result is similar to my experience because her environment is four times mine, so the achieved TPS is also four times mine.

However, there is an obvious problem in this experiment, which is that this example does not have any business logic. For a business system with business logic, the maximum capacity depends on the complexity of the business. Therefore, when I join a new project, I usually first understand the historical performance data before judging whether there is a need for optimization. For systems I am familiar with, after knowing the hardware and software architecture, I can roughly have an expected target in mind.

For unfamiliar systems, it is not difficult to obtain data on the maximum capacity, just perform a capacity scenario test.

Of course, a production system has the ability to make corresponding judgments. Generally speaking, if there are 1000C 2.5GHz CPU resources, we can determine the maximum capacity based on historical experience data; if there are 2000C 2.5GHz CPU resources, how much TPS can be achieved. And all of these can be calculated through capacity scenario analysis.

The reason why it is “generally speaking” is because the maximum capacity and many details are all related, such as the rationality of the architectural design, how much production resources to allocate, and so on. Therefore, there is no standard configuration that can be adapted to any system.

Some may ask, after calculating the TPS through capacity scenario analysis, can we use queuing theory models to calculate the amount of server resources needed? This logic is indeed feasible, but it requires modeling first and sampling a large amount of data for calculations. This topic is extensive, and I won’t discuss it here, but you should know that there is such a direction.

In this lesson, I hope to make you understand the logic of obtaining reasonable configurations through practice.

Do you remember this performance analysis decision tree?

The components shown in the figure are various components used in the example system used in this course. For each component, we should provide a reasonable performance configuration.

So, what aspects does performance configuration mainly refer to? We need to look at it from the perspectives of hardware and software.

Hardware Configuration #

Hardware configuration is actually a large part of the content. Usually, we are limited by hardware resources in the testing environment. Therefore, we will calculate the approximate capacity in the following way:

Obtain the hardware configuration of the production environment, as well as the resource utilization, TPS, and RT data under peak scenarios.
Under the hardware configuration of the testing environment, calculate the resource utilization, TPS, and RT data under peak scenarios through capacity scenarios.
Compare the data obtained from the first and second steps.

By following these three steps, we can determine the maximum TPS that the system can support in the production environment. If we create a simple example table, it would look like this:

This means that if we use 30% of 1000C in the production environment, while the capacity can reach 10000TPS and the average response time can reach 0.1s, then in the testing environment, we need to reach at least 100% utilization of 300C to achieve a capacity of 10000TPS and an average response time of 0.1s.

Of course, there are many reasons to argue that this logic is not reasonable. For example, the most obvious problem is that when the CPU reaches 100%, the business system is obviously unstable, and the increase in TPS is not linear. Other hardware resource situations have not been taken into account here.

Indeed, this is obviously a very rough calculation process, and I am only giving you an example here. When you actually do the calculation, you can list all the relevant important resources. And this modeling process requires a large amount of sample data for analysis.

We use a table to roughly model and compare the TPS generated by different environments:

If our testing environment has 300C resources and the utilization is 30%, and I still want to guarantee an average response time of 0.1 seconds, then the TPS should be 3000. This is the simplest proportional method.

However, different hardware has many factors, so we need to model in a project. And the factors to consider in modeling can only be obtained from a specific project, roughly including the following points:

Hardware and software configuration;
TPS and RT data for the production and testing environments;
Resource utilization data for the production and testing environments (using the global counters in the performance decision tree).

Because each business system consumes resources differently, either compute-intensive or IO-intensive, when comparing counters, we certainly need to compare the counters that consume quickly.

After obtaining the above data, we can create the proportional model in the table above to calculate the maximum capacity in the testing environment.

But this data is still not complete because we also need to pay attention to the software configuration.

Software Configuration #

For software configuration, we also need to perform appropriate ratio calculations. Let’s expand the previous table:

If we achieved the hardware configuration in the testing environment but not the software configuration, like the table below, how can we calculate the TPS and resource utilization in the testing environment?

Obviously, the data represented by the two question marks in the table would be different in this case. By performing calculations, you can see that in order to achieve 1000 TPS in the testing environment, the resource utilization can only reach 1/10 (which is 30C).

Of course, the actual modeling process is not that simple, and it cannot be completed with just one or two counters. In the actual modeling process, which counters should be included in the calculation? This involves all performance counters in the performance analysis decision tree. These counters will be related to the corresponding performance configuration. Therefore, we need to match the performance analysis decision tree and create a performance configuration tree.

Performance Configuration Tree #

In accordance with the performance analysis decision tree mentioned earlier, we will now draw a performance configuration tree.

Performance Analysis Decision Tree:

Performance Configuration Tree:

By comparing the two, I believe you have noticed that I added a “Primary Parameter Type” in the performance configuration tree. When expanded, we can see the following list:

In this list, the parameters included in “Hardware” and “Operating System” may appear to be the same, but the content we actually need to compare is different. For example, when it comes to the CPU, at the hardware level, we need to compare the model, frequency, number of cores/NUMA, and other information. However, at the software level, what we need to compare is the CPU utilization rate. Similar distinctions can be made for other performance parameters and counters.

Regarding application software, I have listed the most common parameters for comparison, which means that for each software technology component, we need to consider the configuration that needs to be extracted from these perspectives.

I want to clarify that the performance configuration tree I described here represents a general characteristic, and therefore cannot be all-encompassing for the configuration of each component. In specific technology components, you need to make changes flexibly. To take MySQL as an example, for memory, we typically consider “innodb_buffer_pool_size”, while for Java microservices, we typically use JVM to express configurations.

Therefore, for each technology component in the performance configuration tree, we need to be more specific. Taking the most common Java microservice application as an example, the range we need to consider is shown in the following diagram:

Due to the large number of parameters, it is not possible to express them all in the diagram. I have directly replaced them with ellipses. For other technology components, we also need to list out the important configurations in a similar manner.

Here, I provide you with a table of common performance parameters for various systems. Additionally, I have included the complete performance configuration tree for your reference. You can click here to download the file, with a password of 4f6u.

Not all parameters in this file are necessarily related to performance. You only need to filter them based on the types I mentioned earlier, such as thread count, timeouts, queues, connections, caches, etc. Furthermore, I have marked the important parameters in red based on my work experience, but this is only for your reference. In your own project, you can list your own parameter list following the logic of the performance configuration tree.

With that said, we will move on to the next step: obtaining the specific configuration values of these parameters in the production environment.

How to Obtain Configuration Values #

The method of obtaining configuration values can be divided into two steps:

Run the scenario;
View the corresponding counters.

Now, let’s take the Order service as an example to see how to determine the configuration values for the relevant parameters.

Stress Scenario Data #

First, we execute the capacity scenario in the performance project to determine approximately how much TPS can be achieved.

In this scenario, you can see that at 30 stress threads, the TPS can reach about 800. However, as the stress increases, the TPS can reach 1000, but the response time also increases significantly.

Next, let’s analyze what kind of configuration this state requires.

Since there are too many configurations and confirming them is a very detailed job, it is unlikely that we will cover them all. However, I will tell you the logic for determining the configuration. In this way, you can determine the relevant performance parameters of each technical component in your own project based on this logic.

Application Service Thread Configuration #

Let’s first see what the current configuration of Order is:

server:
  port: 8086
  tomcat:
    accept-count: 10000
    threads:
      max: 200
      min-spare: 20
    max-connections: 500

Before the stress, the state of application threads is as follows:

After the stress, the state of application threads is as follows:

From the number of threads, we can see that the number of threads is increasing adaptively. Looking at the TPS curve and the place where the response time increases in the stress, we can see that there are approximately 41 working threads. As the stress continues to increase, the TPS also increases, but the response time gradually becomes longer. From the perspective of providing services, users will feel that the system is gradually slowing down.

If we want to ensure that the response time of the system does not slow down due to an increase in the number of users in production, we can consider implementing rate limiting measures in this service.

As for the service threads we want to confirm in this lesson, we only need around 41 threads to support around 800 TPS. Therefore, the configured 200 threads are not needed.

At this point, we have determined a very important performance parameter: the number of threads. So, what should we configure it to?

At this point, we need to consider how much capacity we want the Order service to support. If it is acceptable for a node to provide 800 TPS and the corresponding response time is also stable, then we can set the number of threads slightly higher than 41, such as 45 or 50 threads.

You may wonder why not simply set the number of threads to 200, which is much greater than 41? In fact, this is not recommended. If we consider the peak traffic, when the traffic is high, the response time of this service will become longer until it times out and exits, which obviously gives users a worse experience. Therefore, this kind of configuration is not recommended.

A better approach is to give users a friendly prompt when this service cannot provide a stable response time. This not only ensures the quality of user access but also ensures the stability of the service.

Now, I will change the max thread to 50 in Nacos and publish the configuration:

Then we restart the Order service. When restarting, you need to be careful because we are using Kubernetes’ automatic scheduling mechanism, so we need to specify the node. If we don’t specify, the POD after restart may run on another worker. We still need to ensure that the two tests are in the same environment as much as possible.

Let’s execute the scenario again:

The TPS has reached 1000, now let’s look at the number of threads:

The number of threads is exactly 50, which means that 50 threads can support 1000 TPS.

Timeout and Queue Configuration of Application Service #

For application services like Java, we also need to consider several important performance configuration parameters, such as timeouts, queues, etc., which are also listed in the configuration tree we mentioned earlier. Now, while keeping 50 threads, let’s change the queue length. The accept-count we saw above is 10000. To make the experiment effective, we directly reduce it to 1000, and then see the effect of the stress scenario:

You see, we can still achieve 1000 TPS. Let’s reduce the accept-count a bit more this time. This time, let’s be tough and reduce it directly to 10, hoping to achieve the effect of error due to insufficient queue length. Let’s see the result:

Huh, why isn’t there an error yet? Oh, it’s my carelessness. I didn’t set the timeout.

So let’s add a parameter called connection-timeout. In Spring Boot’s default Tomcat, the connection-timeout is 60s. Now, I’ll set it to 100ms directly because our Order service’s response time exceeds 100ms:

Let’s execute the scenario again and see the result:

See, there’s an error now. This indicates that the setting of queue length as 10 and timeout as 100 is too small to ensure that every request can return normally. Now, let’s set the queue length to 100 and see:

You see, there are more errors, which is consistent with our expectations. Because the queue is longer and the timeout is shorter, naturally there will be more requests in the queue that timeout. And in the curve above, we can also see that the number of errors has increased a lot.

So how do we configure the timeout? What we need to do is to increase the timeout, increase it to a value greater than the maximum response time, only in this way can we avoid errors.

In the results above, we can see that the response time is basically below 200ms, so let’s set the timeout to 200ms and see the result:

You see, there are fewer errors. This indicates that the timeout is an important parameter in performance optimization, and it is related to the queue length.

Let’s put the results of the previous scenarios in one graph and take a look:

With such a graph, we can clearly see the effect of different settings of thread pool (number of threads), timeout, and queue length.

Therefore, in this application, the key parameters we can set are:

With such a configuration, combined with measures such as flow control, fallback, and circuit breaking, we need to ensure that the requests to this service are within 1000 TPS.

If you want to make this system support more requests at the expense of response time, you can increase the above parameters. The specific value depends on whether you want the system to support more requests or to provide a better user experience.

Summary #

In this lesson, I provided a way to determine the configuration of a production system. The premise for doing this is that we have a clear capacity expectation of the environment to be tested. Once we have the capacity expectation and have optimized the system, we can determine the specific configuration values of each important performance parameter through these two steps:

Initiate stress testing;
Judge the specific configuration values of each important performance parameter based on monitoring and scenario execution data. I emphasize that this step requires great attention to detail, and many experiments need to be conducted.

Since there are many performance-related parameters, this requires us to determine each performance configuration in conjunction with the performance configuration list in the performance configuration tree. You might think that this is a very time-consuming and labor-intensive task. In fact, in a project, this step only needs to be done comprehensively once. In subsequent version changes, we only need to make updates based on the results of performance analysis. And in most projects, this kind of update does not involve major changes in parameters.

Homework #

Finally, please think about the following:

Why is it necessary to determine performance parameter configuration in performance projects?
How do we determine the performance parameters for databases and other technical components?

Remember to discuss and communicate your thoughts with me in the comment section. Every thought you have will help you make further progress.

If you have gained something from reading this article, feel free to share it with your friends for collective learning and progress. See you in the next lesson!