07 What Should the Performance Scenario Data Look Like

07 What should the performance scenario data look like #

Hello, I am Gao Lou (高楼).

In performance projects, performance data is a very important input resource. However, I often see people using very limited data to apply great pressure, which obviously does not match the real scenario. Although the results obtained may look good, they do not provide any valuable insights. Therefore, today we will discuss what performance scenario data should look like.

In RESAR performance engineering, the data used in scenarios needs to meet two aspects:

  • Firstly, the data needs to reflect the distribution of data in the real environment, so that we can simulate the corresponding IO operations;
  • Secondly, the data needs to reflect the inputs from real users, in order to truly simulate the user actions in the real environment.

These two aspects correspond to two types of data: base data and parameterized data. Let’s first look at base data.

Base Data #

In a typical online system architecture, the data used in the system is divided into two parts: static data (represented by red dots in the diagram) and dynamic data (represented by green dots in the diagram). These are the base data that we need to store in performance scenarios. image

From this simple diagram, it is easy to see that without base data, the system would be empty. However, in a production environment, the system will definitely not be empty, so there needs to be sufficient data in it. If the data is not realistic, we cannot simulate scenarios with real data in production, such as memory usage, database IO capability, network throughput, etc.

For static data, the most common problem we encounter is that as soon as we think it occupies a large amount of network bandwidth, we feel that we need to use a CDN; or we think that not simulating static data will not reflect the real scenarios and will not support our optimization results. In fact, where to store the data and how to do it most reasonably and cost-effectively need to be comprehensively considered, instead of blindly following others.

I have seen official portals with only a few traffic but insist on putting a few random pictures on a CDN when doing technical planning to demonstrate how advanced their architecture is.

I also often see some companies considering images on their websites to be very important, and due to their lack of technical understanding and the need for a sense of security, they insist on putting all the images on their own servers. The images are already large, with a size of 3-4MB each, so when a user accesses them, naturally they will complain about the slowness.

Both of these extremes are not desirable. You see, when someone who doesn’t understand the field tells an expert what to do, there is usually no good result, because some non-experts think that as long as they exert pressure, that’s enough, and they don’t really care about the details and the results. In my opinion, the most reasonable way to deal with such issues is to first analyze the business logic and then determine how the technical architecture should be implemented.

We know that static data usually has two places where it can be stored: the web layer of the server and a CDN. For large systems with high traffic and network bandwidth requirements, the data must be placed in a CDN, and there is no other choice (of course, you can choose different CDN providers).

For some small business systems, where the number of users is small and the overall network traffic requirements are low, we can directly place the static data in the load balancer server (such as Nginx) or application server. After a user accesses it once, subsequent accesses can be made directly from the local cache, without putting much pressure on the system.

Now that we have finished discussing static data, let’s take a look at dynamic data. We need to analyze it carefully, because some dynamic data can be placed in a CDN.

Referring back to the previous diagram, when we don’t use any pre-warming, these dynamic data are stored in the database. When we use pre-warming, these data will be moved to the cache (of course, this also depends on the architectural design and code implementation), as shown in the following diagram:

image

Therefore, according to this logic, we need to simulate exactly how much data is involved in real-world scenarios, otherwise we will encounter some issues. When the simulated data volume differs significantly from the actual data volume, it will have different effects on the database, cache, etc. Below, I’ve listed five points to analyze in more detail.

  • Difference in database pressure

Assuming that a production system has a user base of 1 million, when we conduct performance testing, it is difficult to generate production data directly, so we may use only 1000 or even fewer users to test the performance scenarios. So, what is the difference between a table with 1 million records and a table with 1000 records? Let’s see the actual operations.

Here, there is a prerequisite condition: the same hardware environment, the same database, the same table structure, the same indexes, with only the difference in data for the two tables.

The two SQL statements are as follows:

select * from ob_tuning.temp1_1000 where id = '3959805';
select * from ob_tuning.temp2_100w where id = '3959805';

Because the data volume in the table is different, the results are as follows: image

As you can see, the query time for one table is 19ms, while the query time for the other table is 732ms. Let’s also take a look at the operation details of the table. Details of the first table (with a user count of 1000):

Details of the second table (with a user count of 1 million):

From comparing the “executing” row in the two tables, we can clearly see the difference. It tells us that the required CPU time significantly increases with the increase in data volume when executing this statement. So, it is not difficult to realize that if you do not have enough underlying data in a performance scenario, the results will be tragic.

  • Difference in caching

There is a noticeable difference in caching based on the amount of data, as shown in the following figure:

In other words, the larger the data volume used in the scenario, the larger the required cache size.

  • Difference in stress tool usage

The amount of data used in stress tools not only affects the memory requirements of the stress tools themselves but also affects the execution results of the performance scenario. We will discuss this in detail in the later courses.

  • Difference in network

In fact, regardless of whether the data volume is large or small, whether it is cached or in a database, the network consumption between the client and the server is almost the same. As long as it is not cached on the client side, it has to go through the server. Therefore, we believe that the data volume, whether large or small, does not have much difference in terms of network pressure between the client and the server. If you are using a CDN, there can be separate considerations.

  • Difference in application

If the data is not directly cached in the application, we also believe that there is no difference in the application. Regardless of the type of request that comes, the data still needs to be retrieved from the cache or database, and the application’s self time will not make much difference. The methods still need to be executed. However, if your application directly stores data in the application’s cache, there will be a difference. Similarly, the larger the data volume, the larger the memory requirement.

Based on the above points, we can see that there are two important factors that directly impact the performance: the database and the cache.

What about indirect impacts? For example, if the database takes longer to execute, the synchronous calling application will inevitably require more application threads to process.

Let’s assume we have a 100 TPS (transactions per second) system, just focusing on the database time and ignoring other times. If the database execution takes 10 ms, the application only needs one thread to handle it. If the database takes 100 ms and we still want to achieve 100 TPS, the application needs to have 10 threads to process it simultaneously.

At the same time, all threads, queues, timeouts, and other components throughout the entire chain will undergo significant changes due to the impact of data volume. Therefore, if we want to simulate the production environment, we must not be lax in laying the foundation data.

Parameterized Data #

With the analysis of the underlying data in place, we will have a clearer understanding when it comes to parameterization. However, determining the appropriate amount of data to use in a scenario is often the most challenging aspect of performance testing.

The amount of parameterized data needed depends on how long the scenario will run. During the scenario execution, we typically require two types of data: unique data and reusable data.

For unique data (such as user data), it is relatively easy to calculate the amount of parameterized data needed. For example, if a scenario runs for half an hour and has a TPS (Transactions Per Second) of 100, we would require a data volume of 180,000, as calculated below:

\(Data\ volume = 30\ min \times 60\ s \times 100\ TPS = 180,000\)

As for reusable data, we need to analyze how it is repeated in real business scenarios. For example, in an e-commerce system, the number of products can be parameterized since multiple people can purchase the same product simultaneously. Let’s assume there are an average of 1,000 users among 10 products. In this case, with 180,000 users, we would require 1,800 products:

\(Number\ of\ products = \frac{180,000\ users}{1,000\ users} \times 10\ products = 1,800\ products\)

The above calculations explain how to determine the amounts of unique and reusable data.

You may wonder what to do if the volume of parameterized data is too large to be handled by the load testing tool. For example, when using JMeter to handle file-based parameterization, a long parameterization file could cause JMeter to consume more time. In cases where a large volume of parameterized data is required, we can use remote cache (such as Redis) or a database (such as MySQL) for parameterization.

  • Connecting to Redis for Parameterization

Method 1: Use Beanshell in JMeter to connect to Redis and retrieve data.

import redis.clients.jedis.Jedis;
// Connect to the local Redis service
Jedis jedis = new Jedis("172.16.106.130", 30379);
log.info("Service is running: " + jedis.ping());
String key = vars.get("username");
String value = vars.get("token");
vars.put("tokenredis", jedis.get(key));

Method 2: Use the Redis Data Set component.

Both of these methods can be used to use Redis as a data source for parameterization.

  • Connecting to MySQL for Parameterization

Step 1: Create a JDBC Connection Configuration.

Configure the connection information, such as username and password:

Step 2: Create a JDBC Request.

Fetch the data using the JDBC Request:

Step 3: Reference the parameter using ${user_name}.

By completing these three steps, we have successfully parameterized the data using a database.

With an understanding of the data required for RESAR Performance Engineering, let’s move on to discussing how to generate the data.

How to Generate Data? #

Since our e-commerce platform in this project is open source, the database is completely empty, and there is no existing data in the system. Therefore, although we have only implemented the main process of e-commerce, we still need a considerable amount of data. This data includes:

  • User data;
  • Address data;
  • Product data;
  • Order data.

Now let’s consider the specific data volume.

According to our performance plan in Lesson 5, if the login transactions per second (TPS) are 150 and the capacity scenario increases continuously, it will reach the maximum value in about 20 minutes (this is an empirical value, and may vary in specific scenarios). Then it will continue for another 10 minutes, resulting in a total time of approximately 30 minutes.

However, since the scenario is continuously increasing, we don’t initially require a TPS of 150, and we don’t know the maximum TPS for login. Based on experience, the login TPS in the current hardware environment could easily reach 300 to 400 even without using caching.

If we calculate based on a maximum of 400 TPS and run for half an hour, the amount of data required would be 540,000. However, the number of users we create will be much larger than this. Let’s start with creating 2 million users, as the data volume of addresses is definitely larger than that of users, so it will exceed 2 million.

First, let’s check the current database size to determine how much data we need to generate.

Database size

This data size is clearly insufficient, it’s too small. Now let’s explore how to generate such a large amount of data.

The data we generate mainly falls into two categories: user data and order data.

  • User Data

For user data, we need to understand the table structure because the generated data must comply with the business logic. Let’s first take a look at the table structures and data.

User table:

User table - User table data

Address table:

Address table - Address table data

Based on my experience, when generating data, it is not recommended to directly write stored procedures to insert data into the database unless you fully understand the relationships between tables and are proficient in writing stored procedures. Otherwise, you can end up with a messy database and have no choice but to modify the data directly in the database tables, which is a very reactive approach. Here, I recommend using API calls to generate data, as it is simpler and safer.

If you want to generate data using code, then you need to analyze the following.

Here, there is a corresponding relationship between the user table and the address table. You can see from the following code that the MemberID in the address table is the user ID.

@Override
public int add(UmsMemberReceiveAddress address) {
    UmsMember currentMember = memberService.getCurrentMember();
    address.setMemberId(currentMember.getId());
    return addressMapper.insert(address);
}

In fact, generating user data is equivalent to implementing the registration process. You can first analyze the code for user registration and directly use the registration code section. The specific calling code is as follows:

Calling code

At this point, you may wonder if you need to be concerned about the registration process when generating data. If we are calling the API to generate data, we do not need to worry about it. However, if we are writing code to generate data using multiple threads, then we need to understand the calling relationships between the APIs.

Let’s take a look at the middle part of the code to analyze the relationships:

API calling relationships

Since the passwords in the user table are encrypted, we can use the register user implementation class, as shown below:

@Override
public void register(String username, String password, String telephone, String authCode) {
    ...............................
    // Get default membership level and set it
    UmsMemberLevelExample levelExample = new UmsMemberLevelExample();
    levelExample.createCriteria().andDefaultStatusEqualTo(1);
    List<UmsMemberLevel> memberLevelList = memberLevelMapper.selectByExample(levelExample);
    if (!CollectionUtils.isEmpty(memberLevelList)) {
        umsMember.setMemberLevelId(memberLevelList.get(0).getId());
    }
    // Insert the user
    memberMapper.insert(umsMember);
    umsMember.setPassword(null);
}

After understanding the above content, we can directly write some code to generate user data. Please refer to Generate User Code.java for the specific code.

Once we have user data, we still need detailed information about the addresses of users placing orders. Only then can we complete the order process. So, next, we will analyze how to generate address data that can be used for placing orders.

  • User Addresses

First, we need to find the controller layer based on the user address resource path and check the calling relationships of the user address code, as shown below:

Controller layer

Then, find the key code for user address generation:

@Override
public int add(UmsMemberReceiveAddress address) {
    UmsMember currentMember = memberService.getCurrentMember();
    address.setMemberId(currentMember.getId());
    // Insert the address
    return addressMapper.insert(address);
}

From this code, we can observe the following information:

  • Calling the address API requires user login state to parse the user ID;
  • The user ID appears as MemberID in the address code;
  • The user ID is auto-incremented.

For a specific example, please refer to Generate User Address Code.java.

By writing the above code and then starting a Java thread pool with multiple threads, we can quickly generate the basic data. Here is the time record for generating user address data (each computer may have different configurations, so the resulting data may vary):

Time record

By using the methods mentioned above, we finally generated the following data:

Data volume

The order data will be supplemented when we perform benchmark scenarios. Once we have all these data volumes, we will have enough foundational data for the capacity scenario.

Summary #

In this lesson, we learned what data in performance scenarios should look like. There are many methods for generating data, so we don’t need to stick to a specific approach. The priority is to generate a sufficient amount of data quickly. In RESAR performance engineering, there are two types of data required for performance scenarios: baseline data and parameterized data. The baseline data needs to meet the following three conditions:

  • It must generate data in quantities equivalent to production scale.
  • The data should accurately simulate the distribution of production data.
  • The data should be realistically usable.

Parameterized data needs to meet the following two conditions:

  • The amount of parameterized data needs to be sufficient.
  • It should reflect the input data of real users.

With this knowledge, we will avoid confusion when generating data.

Homework #

That’s all for today’s content. Finally, I’ll leave you with two questions to ponder:

  1. Why is it necessary to generate data that meets the production level?
  2. Why should we use input data that reflects real users when parameterizing?

Remember to discuss and exchange your thoughts with me in the comment section. Every thought will take you further.

If you found this lesson rewarding, feel free to share it with your friends and learn and progress together. See you in the next lecture!