32 Pressure Testing How to Design a Full Path Pressure Testing Platform

32 Pressure Testing - How to Design a Full-Path Pressure Testing Platform #

Hello, I am Tang Yang.

After two lessons of learning, we have already set up monitoring for the server and client. With the reports from monitoring and the configuration of some alert rules, you can track and solve problems in the vertical e-commerce system in real-time. However, you should not take it lightly because monitoring can only detect existing problems in the system and cannot address potential performance issues in the future.

Once your system traffic experiences a significant increase, such as during a peak event like “Double Eleven” (Singles’ Day), you may feel helpless when facing performance issues. To address potential concerns, you need to understand which components or services of the system become bottleneck points when the traffic increases several-fold. This is when you need to conduct a full link pressure test.

So, what is a pressure test? How do you perform a full link pressure test? These are the main topics of this lesson.

What is Stress Testing #

The term “stress testing” (also known as load testing) is something you have probably heard many times in industry discussions, and you may have even done stress testing during the development process of a project, so stress testing is not unfamiliar to you.

However, let me ask you, how do you usually conduct stress testing? Is it similar to what many students do: first set up a test environment that is identical to the production environment in terms of functionality, and import or generate a batch of test data. Then, on another server, start multiple threads to simultaneously call the interface being tested (the parameters of the interface are usually set to be the same, for example, if you want to stress test an API for retrieving product information, then the same product ID will be used for the stress test). Finally, by counting the access logs or looking at the monitoring system of the test environment, you record the final stress test queries per second (QPS) and call it a day?

Doing stress testing in this way is actually incorrect, with several main mistakes:

First, when performing stress testing, it is best to use actual production data and the production environment. This is because you cannot be sure if the test environment you set up will differ from the production environment and whether this difference will affect the results of the stress testing.
Secondly, during stress testing, simulated requests should not be used; instead, actual traffic from the production system should be used. You can copy the production traffic to the stress testing environment. This is because the access patterns of simulated traffic and production traffic are quite different, which can significantly affect the results of stress testing.

For example, when retrieving product information, the production traffic will retrieve data for different products, and some of these products’ data may be cached while others may not. If the stress test uses the same product ID, only the first request will miss the cache and subsequent requests will always hit the cache. In this case, the stress test data is not representative.

Do not generate traffic from a single server. This easily leads to the performance bottleneck of that server, resulting in a low QPS for the stress test and ultimately affecting the results. Moreover, in order to simulate user requests as realistically as possible, it is preferable to generate traffic from machines closer to the users, such as CDN nodes. If this is not possible, different data centers can be used, ensuring the authenticity of the stress test results as much as possible.

The reason why many students make these mistakes is mainly because they do not fully understand the concept of stress testing and think that as long as multiple threads are used to concurrently call the service interface, it counts as stress testing.

So what exactly is stress testing? Stress testing refers to testing that is conducted under high concurrency and heavy traffic. Through observing the performance of the system under peak loads, testers can identify performance vulnerabilities in the system.

Like monitoring, stress testing is a common way to discover problems in a system and is an important means of ensuring system availability and stability. During the stress testing process, it is not sufficient to only test a core module; the access layer, all backend services, databases, caches, message queues, middleware, as well as third-party service systems and their resources should all be included in the stress test. This is because, once user behavior increases, the entire chain of components including the above services will be impacted by unpredictable heavy traffic. Therefore, they all need to rely on stress testing to find potential performance bottlenecks. This kind of stress testing that covers the entire call chain is also called “end-to-end stress testing”.

In internet projects, due to the rapid pace of feature iterations, the complexity of the system is also increasing. The newly added features and code may become new performance bottlenecks. Perhaps half a year ago, a single machine could handle 1000 requests per second during stress testing, but now it can only handle 800 requests per second. Therefore, stress testing should be conducted periodically as a routine means of ensuring system stability.

However, usually, to conduct an end-to-end stress test, it requires coordination among multiple teams, including DBAs, operations, service providers, middleware architects, etc. The cost of human resources and communication coordination is high. At the same time, if there is no good monitoring mechanism during stress testing, it could have adverse effects on the live system. To solve these problems, we need to build an automated end-to-end stress testing platform to reduce costs and risks.

How to build a full-link load testing platform #

There are two key points to consider when building a full-link load testing platform.

The first point is traffic isolation. Since load testing is performed in a production environment, it is necessary to differentiate between load testing traffic and regular traffic. This allows for separate handling of load testing traffic.

The second point is risk control. It is important to minimize the impact of load testing on normal user access. Therefore, a typical full-link load testing platform should include the following modules:

Traffic construction and generation module;
Load testing data isolation module;
System health check and load testing traffic intervention module.

The architecture diagram of the overall load testing platform can be as follows:

To give you a clearer understanding of each module and help you design a full-link load testing platform suitable for your business, I will provide more detailed explanations for each module. Let’s start by looking at how load testing traffic is generated.

Generation of load testing data #

Generally, the entry traffic of our system comes from HTTP requests from clients. Therefore, during peak hours, we would consider copying this entry traffic and after some traffic cleansing (such as filtering out invalid requests), store the data in NoSQL storage components such as HBase, MongoDB, or cloud storage services like Amazon S3. We refer to this as the traffic data factory.

In this way, when we need to perform load testing, we can retrieve the data from this factory and distribute the data to multiple load testing nodes after splitting the data into multiple parts. Here, I would like to emphasize a few points that you need to pay special attention to.

Firstly, there are multiple ways to copy traffic. The simplest way is to directly copy the access logs of the load balancer server, and store the data as text in the traffic data factory. However, when initiating load testing, you would need to write parsing scripts for the access logs, which would increase the cost of load testing. Therefore, it is not recommended to use this method.

Another way is to use open source tools to copy traffic. Here, I recommend a lightweight traffic replication tool called GoReplay. It can intercept traffic on a specific port of a local machine, record them in files, and transmit them to the traffic data factory. With this tool, you can also accelerate traffic replay during load testing, enabling load testing on production environments.

Secondly, as mentioned above, when distributing load testing traffic, it is important to ensure that the traffic distribution nodes are closer to users and at least not in the same data center as the service deployment nodes. This helps to maintain the authenticity of load testing data.

Additionally, we need to color the load testing traffic, i.e., add load testing tags. In actual projects, I would add a tag item in the HTTP request header, such as “is stress test”. After replicating the traffic, the load testing tags would be added in batches to the requests before writing them into the traffic data factory.

Data isolation #

While copying the load testing traffic, we also need to consider modifying the system to achieve isolation between the load testing traffic and the production traffic. This way, we can avoid the impact of load testing on the online system as much as possible. Generally speaking, there are two things we need to do.

Firstly, for read requests (also known as downstream traffic), we will perform mock or special handling for services or components that cannot be load tested. For example, in business development, we usually record user behavior based on requests. For example, if a user requests a page for a certain product, we will record that the product has been viewed once, and this behavioral data will be written into a separate big data log and transmitted to the data analysis department to generate business reports for product or management decision-making.

During load testing, these behavioral data will definitely increase. For example, if the original number of page views for a product in a day is 100 million, it will become 1 billion after load testing. This will have an impact on the business reports and subsequent product direction decision-making. Therefore, we need to handle these user behaviors generated during load testing differently by excluding them from being recorded in the big data log.

Another example is that our system depends on some recommendation services to recommend products that users may be interested in. However, one characteristic of displaying these recommendations is that once a product has been displayed, it will not be recommended again. If your load testing traffic passes through these recommendation services, a large number of products will be requested by the load testing traffic, and online users will no longer see these products, thereby affecting the effectiveness of the recommendations.

Therefore, we need to mock these recommendation services so that requests without load testing tags pass through the recommendation services, while requests with load testing tags pass through the mock services. When building mock services, you need to pay attention to one thing: it is best to deploy these mock services in the same data center where the real services are located. This way, you can simulate the real service deployment structure as much as possible and improve the authenticity of the load testing results.

On the other hand, for write requests (also known as upstream traffic), we will write the data generated by the load testing traffic into a shadow database, which is a separate storage system completely isolated from the production data storage. For different types of storage, we use different methods to set up the shadow database:

If the data is stored in MySQL, we can create a set of library table structures that are the same as the production environment in the same MySQL instance but in a different schema, and import the production data into it.

If the data is stored in Redis, we add a unified prefix to the data generated by the load testing traffic and store it in the same storage.

Some data may be stored in Elasticsearch, and for this part of the data, we can place it in a separate index table.

With the special handling of downstream traffic and the addition of a shadow database for upstream traffic, we can achieve isolation of the load testing traffic.

How to implement stress testing #

After copying the online traffic and completing the modifications to the online system, we can start implementing stress testing. Before that, it is generally advisable to set a goal for the stress test, such as achieving a QPS (queries per second) of 200,000 for the overall system.

However, during the stress test, the request volume will not be immediately increased to 200,000 queries per second. Instead, it will be gradually increased using a certain step size (e.g. increasing by 10,000 QPS each time) to gradually increase the traffic. After each increase in traffic, the system is allowed to run stably for a period of time to observe its performance. If any bottlenecks are identified in the dependent services or components, the load testing traffic can be reduced first, such as falling back to the QPS of the previous test, to ensure the stability of the services. After that, the specific service or component can be scaled up, and then the traffic can be further increased during load testing.

To reduce the human resource cost during the stress testing process, it is advisable to develop a traffic monitoring component that sets some performance thresholds in advance. For example, the threshold for CPU usage of containers can be set to 60% to 70%, the upper limit of average response time can be set to 1 second, and the percentage of slow requests can be set to 1%, and so on.

Once the system reaches these thresholds, the traffic monitoring component can detect it in a timely manner and notify the load testing traffic distribution component to reduce the load testing traffic. It can also send alarms to the development and operations teams for them to quickly identify and resolve performance bottlenecks before continuing the load testing.

There have been many explorations in the industry regarding end-to-end load testing platforms. Some major companies such as Alibaba, JD.com, Meituan, and Weibo have developed their own end-to-end load testing platforms that suit their respective business needs. In my opinion, these load testing platforms are all based on similar principles, including traffic copying, traffic coloring and isolation, load injection, and monitoring and circuit breaking. These principles are consistent with the core ideas introduced in this course. Therefore, when considering developing your own end-to-end load testing platform that suits your project, you can also follow this mature approach.

Course Summary #

In this lesson, I introduced the common misunderstandings in stress testing and the process of building an automated end-to-end stress testing platform. The key points you need to understand are as follows:

Stress testing is an important means to discover potential performance issues in a system, so it should be conducted in a formal environment with real data.
The traffic for stress testing needs to be marked so that stress testing data can be isolated from production data through mocking third-party dependencies and shadow libraries.
During stress testing, it is important to monitor and alert system performance indicators in real-time, and promptly scale up bottleneck resources or services to avoid impacting the production environment.

This end-to-end stress testing system has three values for us: Firstly, it helps us discover potential performance bottlenecks in the system, enabling us to prepare contingency plans in advance. Secondly, it can also be used for capacity assessment and provide data support. Lastly, we can use it for contingency plan rehearsals during stress testing because stress testing is usually scheduled during low traffic periods. This allows us to downgrade some services to test the effectiveness of contingency plans and minimize the impact on live users. Therefore, as your system’s traffic grows rapidly, it is important to consider building such an end-to-end stress testing platform to ensure the stability of your system.