01 Why Many Performance Testing Personnel Cannot Be Held Accountable for Performance Results

01 Why Many Performance Testing Personnel Cannot be Held Accountable for Performance Results #

Hi, I am Gao Lou (高楼).

Today is the first lecture of our class, and I will introduce you to a comprehensive concept of performance - RESAR Performance Engineering. It is different from the logic of performance testing. What exactly is different? I will explain it to you in detail below. Additionally, during this process, I will also provide you with a comprehensive and systematic understanding of the work that needs to be done in performance engineering. I believe that this class will not only change your understanding of performance but also provide you with guidance throughout your performance project.

Being responsible for performance requires more than just “testing” #

In the column “Performance Testing in Practice” from 《性能测试实战30讲》, I provided the following concept of performance testing:

Performance testing involves establishing performance metrics for the system, developing performance test models, devising performance test plans, formulating monitoring strategies, executing performance scenarios under specified conditions, analyzing and identifying performance bottlenecks for optimization, and finally evaluating whether the system’s performance metrics meet predetermined values.

Even now, I still believe this is the most reasonable description of “performance testing”. In fact, this concept has sparked some debates, mainly regarding whether the performance testing team should be responsible for bottleneck identification and optimization.

Setting aside the concept itself, let’s consider the following question: If the performance testing team does not identify and optimize bottlenecks, can they still provide answers like “there will be no performance issues in the production system after the system goes live”?

If not, then what is the use of the performance testing team? Are they only capable of finding basic technical issues? It’s like a patient going to the hospital for treatment, undergoing surgery and taking medication, then asking the doctor after all the trouble, “When will I recover?” If the doctor responds, “I don’t know!” Just imagine how the patient would feel, as if they encountered an incompetent doctor.

Likewise, it is the same in performance testing. Our performance projects have macro goals:

Identifying and optimizing performance bottlenecks in the system.
Meeting the business capacity requirements, ensuring the smooth operation of the online system.

I suggest you take a closer look at the second goal and think again about the question we discussed earlier: Should the performance testing team be responsible for bottleneck identification and optimization? Now, do you have an answer? Of course, it is necessary! However, in the current performance market, I see many performance teams unable to achieve even the first goal, let alone the second goal.

With the answer to this question in mind, let’s revisit the definition of “performance testing” provided earlier. You may not have realized it, but for “performance testing,” the definition given above may be sufficient. However, for “performance” itself, is performing the tasks defined enough? Not really.

From a holistic perspective of a performance project, after a system goes live and experiences normal business scenarios, there is one more thing we need to do: Retrieve the performance data from the live environment and compare it with the data collected during the performance testing process to see if what we did before aligns with the actual business scenarios. This comparison is referred to as performance benchmarking, which includes our performance model, performance metrics, and so on.

If there is no discrepancy after the comparison, it means the performance testing project was done very well. If there are discrepancies, then we need to correct them to make the next performance testing align better with the real system. Therefore, you can see that from a complete performance activity perspective, the definition of “performance testing” that we reviewed earlier lacks one aspect: performance benchmarking.

However, performance benchmarking cannot be considered part of the “testing” work itself (please note that I am saying: performance benchmarking is not part of the “testing” work, but it is still part of the performance team’s responsibilities).

This is why, regardless of how we expand the concept of “testing,” whether it is “shift left” or “shift right” (I don’t really understand the specifics, isn’t it just doing the work?), whether it is “gray box” or “white box,” as long as we mention “testing,” it is still confined to a specific time frame within a project. Just as with agile, lean, or waterfall, as long as it is in a specific project, people generally still believe that business requirements come first, followed by product design, then architecture design, development, testing, and operations.

Some may say that we can go live without testing. Yes, you can do that as long as you can bear the risk of going live. However, if the proportion of encountering problems between “testing” and “not testing” is 1:10 based on your experience, then testing probably won’t be abandoned. This is a very reasonable logic.

To achieve a good performance and truly accomplish the two macro goals of performance, we cannot limit ourselves to just “testing.” We cannot perceive it as merely a step within a project, but rather, we need to take an engineering perspective to approach “performance.”

Three Aspects of Responsibility for Performance Results #

At this point, we need to consider another question: If we limit testing to a specific timeframe of a project, does the testing team need to take responsibility for the entire production environment? I believe you already have the answer in your mind, and that answer is no, because the testing team is confined to a specific aspect.

However, I believe that many testing professionals still end up being blamed for performance issues.

If the worst consequence of a problem is just getting scolded by your boss and not losing your job, then it’s not a big deal. But if you are doing third-party testing, have you taken responsibility for the production environment? Are you willing to take responsibility? In my career, I have heard countless times about systems going live and experiencing performance issues, resulting in millions or even billions of dollars in economic losses. To be honest, as members of a performance team, we cannot afford to take on this responsibility.

As a performance tester, if we want to take responsibility for performance results, we need support in at least three areas. Note that “testing” cannot be limited to a specific timeframe of a project; otherwise, it won’t work.

Technical Details

First of all, our technical details need to be consistent with the production environment, such as software and hardware configurations, network architecture, basic data, test scenarios, and monitoring setup, among others. I will explain all of these in this column. Speaking of which, let me give you a reminder: in a performance team, your basic skills must be sufficient to support the project. This is a prerequisite.

Regarding the need for technical details to be consistent with the production environment, some may mention the concept of “end-to-end” in the production environment. What I want to say is, please don’t blindly apply concepts without being rational. In many systems, it is not possible to make such changes in the production environment for end-to-end testing, and the reason is simple: if your end-to-end testing causes a major incident in the production environment, it is not worth sacrificing an entire team just for that.

Scope of Work

To take responsibility for the overall system performance, the scope of work for performance testers needs to be expanded, and it should be expanded forward to include performance requirements .

Some may say that performance requirements are set by performance testers, right? We cannot shoulder this unfair blame. Because performance requirements are set by business, architecture, development, testing, operations, and even non-technical leadership together. If performance requirements were only set by testing professionals, in my experience, there is a 99.99% chance that performance projects would turn out like this: the project would only focus on finding basic technical bottlenecks, and some performance teams may not even be able to find those bottlenecks.

In addition to expanding the scope of performance testing to include performance requirements, it also needs to expand backward to include the operations process . I am not saying that performance team members should participate in operations, but we should take the data from the operations process, make a comparison, and iterate our performance implementation process.

Work Authority

Of course, as the scope of work expands, authority and responsibility should be aligned.

I have encountered many companies where their performance teams are in a weak and neglected position: externally, they cannot surpass architecture, development, and operations; internally, they lack technical confidence; upward, they follow the orders of their leaders; downward, oh, there is no one underneath, so they don’t need to be accountable.

Think about it, in such a situation, if you don’t even have the permissions to operate on a system or a database, how can you optimize its performance? For a system on which you don’t even have the authority to optimize, if it encounters performance problems, it definitely shouldn’t be your responsibility.

So what are the work authorities that we need? There are two: technical authority and command authority .

Technical authority is easy to understand and involves permissions like root access to host machines and DBA access to databases. On the other hand, command authority means that when we need someone to do something, we must be able to give the order. For example, if you ask the operations team to check production data, but they respond with a dismissive look, you won’t be able to accomplish your task. Therefore, we must have a seamless connection between what data we need, what results it will generate, and if any part of this chain is missing, we won’t be able to continue.

Of course, when we talk about having command authority, it doesn’t mean randomly giving orders to others for unrelated tasks, like saying “Come here, developers, give me a shoulder massage” or “Come here, operations team, buy me a coffee” … that’s just asking for trouble.

After discussing these aspects, how should performance testing be done? This leads us to the concept of “performance engineering”.

What is Performance Engineering? #

From “testing” to “engineering”, it may seem like a simple change in description, but in reality, it is a completely different logic for doing things.

Let me first provide a definition here - RESAR Performance Engineering.

What we usually refer to as performance engineering is the process of applying various technologies in IT to specific performance projects. On the other hand, the RESAR Performance Engineering that I mentioned is a more detailed description of the specific actions in the performance project, so that it can be implemented as concrete practices.

“RESAR Performance Engineering” is a name I defined myself, so you don’t need to search for it online, as you won’t find any results yet. I will now describe the process of RESAR Performance Engineering for you. Please note that we will not discuss what kind of development model (such as Agile, DevOps, etc.) to use because these are just different ways of organizing the process. We will temporarily set them aside and focus on what performance engineering actually involves.

Business Requirements

From the perspective of the entire project lifecycle, after having the business requirements, we need to start analyzing the business critical points where performance issues may arise, such as business paths, hot data, flash sales, real-time peak business, daily batch processing, etc. We then create a business model.

For a new system, even if we have to come up with a business model on the spot, and for an existing system, we can obtain it by measuring the volume of production transactions.

Project Initiation

After having the business requirements, the technical project initiation begins. At this stage, people with a mindset for performance architecture need to be involved in the project initiation phase to provide professional opinions in terms of technology selection and architectural design, in order to avoid potential performance issues in the future.

Specifically, this person with performance architecture thinking needs to do things related to performance and architecture, such as high availability, scalability, load balancing with SLB, TCP layer optimization, DNS optimization, CDN optimization, etc. To be more detailed, it involves configuring thread pools, connection pools, timeout settings, queue configuration, compression configuration, and other details for each component.

Once we have this content, we can start doing capacity evaluation, capacity modeling, and capacity simulation, among others.

Development

Next, we move on to the development phase. In this phase, after implementing a new feature, the performance team’s task is to list the execution time of each method and the memory consumption of objects when they are not under any pressure, so that corresponding calculations can be made during capacity scenarios.

This is a tedious job, but we can use some tools for overall analysis instead of looking at each method and object individually. Usually, this step has a more general name in the academic field, and you may have heard of it - white box testing.

In fact, most people in the industry who perform white box testing are only concerned about whether the functions work properly, and very few pay attention to performance. Moreover, this work is often done by development engineers. Here, we are not discussing the issues related to self-testing, as we cannot deny the responsibility of all development engineers.

If we simply look at the general project lifecycle, with the means of compressing by capitalists, developers are already exhausted after developing the business functions. Can they still have time to do these tasks? This is where the value of the performance team comes in - to take the code and perform performance analysis.

So, it is not reasonable to say that the people in the performance team don’t understand development. From the perspective of performance engineering, testing personnel need to have certain skills.

Testing

After having complete business functionality, we move on to the testing phase, where performance testing engineers can finally make their “formal” appearance.

In this phase, we need to verify all the performance-related work done previously, according to the sequence of baseline scenarios (single interface, single system capacity scenarios), capacity scenarios (peak, daily batch, flash sale, normal daily usage scenarios), stability scenarios, and exception scenarios.

Regarding whether exception scenarios should be included in performance, there has always been debate, but let me clarify that, in my concept, as long as it is a scenario that requires stress, it can be included as part of performance testing.

Operations and Maintenance

After the system is launched and goes into operation, we still need to compare the business data and performance monitoring data generated during the operation and maintenance process with the performance scenario results obtained earlier. If there are any problems identified, we need to make adjustments to the performance process and continue from the correction point.

In the column “Performance Testing in Practice” (《性能测试实战30讲》), I summarized the concept of performance testing with the following diagram:

Based on what I just explained, I made some changes to this diagram: adding “Business and Architecture Analysis”, “Environment Preparation”, and “Production O&M” as three additional parts.

Now this diagram fully describes the process of RESAR Performance Engineering.

Having understood what performance engineering involves, let’s take a comprehensive look at “RESAR Performance Engineering”. The key points for implementing RESAR Performance Engineering are:

Analyzing business logic and technical architecture, creating performance models, formulating performance plans, preparing the application environment, and designing and implementing performance deployment monitoring.
Implementing stress scenarios that match the real business logic.
Building a performance analysis decision tree and obtaining performance counters for various components through monitoring.
Analyzing the counter data to find the root causes of performance bottlenecks and optimizing them.
Correcting scenarios based on performance data from the production environment.

Regarding performance engineering, you may have many questions now. Next, I will focus on discussing a few key points.

Communication cost is necessary #

First, let’s take a look at the process diagram of our performance project.

In the diagram, you may find that the steps of business and architecture analysis, performance requirements, performance modeling, and production operation are relatively thin. In fact, their workload is not small, and they may cost you a significant amount of communication and operational effort, consuming a long project time. However, these three steps can be carried out in parallel with other work. Therefore, the overall project cycle will not be prolonged.

Some people may ask, “Doesn’t performance work consume a lot of energy? It takes time and money, so is it worth it if there are no visible results?” If you also have this confusion, you might as well take a look at the discussion we had earlier:

If the performance testing team does not identify bottlenecks and optimize them, can they provide an answer like “the production system will not have performance issues after going live”?

If not, then what is the purpose of this performance testing team? Is it just to find basic technical problems? It’s like a patient going to the hospital for treatment, having surgery and taking medication, and then asking the doctor, “When will I recover?” If the doctor says, “I don’t know!” Think about how the patient would feel, as if he encountered an incompetent doctor.

In fact, when it comes to communication and operational costs, it may be because I have encountered many “harsh” working environments. I have experienced too many situations, so I am very aware that the time cost of communication is much greater than the time cost of technology consumption. However, for a company, once these tasks are smoothly executed for the first time, they will hardly consume any more time. And if you use the time you spend watching short videos and shopping every day to do these activities, it will be enough.

Recently, I consulted a company and used data from production operations to perform analysis and comparisons. It took me less than two hours to figure out the business model in production. However, the initial sampling time cost is indeed higher because we may need to build some platform tools to support our ideas. Once the platform tools are established, the subsequent operational part becomes relatively simple and less energy-consuming. The data is already there, and the performance team only needs to look at those data.

However, in the client scenarios I have encountered, it is often very difficult for the performance team to obtain operational monitoring data. In addition, if your technical skills are not good and you are unable to articulate your thoughts, people will be more skeptical about your true intentions in requesting production data.

Therefore, if I see such a scenario when I first start a project, I will definitely not ask operations for data to analyze it myself. Instead, I will ask them for the resulting data and I will define a framework for them to provide data within that framework.

What if they can’t provide it? It’s okay, do you remember the trick used by elementary school students? Tell the teacher! After communication, if the leaders understand the purpose of these data, things will become easier. If operations think that all the log data in production is critical and core data, that’s fine too. Just ask them to assign someone from operations to accompany you every day. After working together and repeatedly providing performance data, they will eventually give you the necessary permissions, haha.

Actually, regardless of the position or background of the people involved in performance work, there is one thing you must remember. As a member of the performance team, when communicating with other teams, you must be precise and specific in posing questions, clearly explaining why things need to be done in a certain way and the costs, benefits, and drawbacks of doing or not doing it.

Is it necessary to do so?

Let me give you an example. Think about why we need to use the proportions of real-world business scenarios for testing models. The answer is very simple: if you don’t do this, the results of the testing will definitely not be able to answer questions about production capacity. If a business scenario accounts for 10% in production, but you set it as 20% in the performance scenario, it may yield completely different results.

In performance projects, there are many execution deviations caused by poor communication. Therefore, we must understand how each party wants to express their ideas and how to execute them in specific operational terms. This communication process is very, very important.

Who is responsible for driving performance engineering? #

I hope you understand that all the efforts I am making here are aimed at creating a complete loop for performance. Speaking of which, there is still a major bug, and that is who is responsible for driving performance engineering.

When it comes to project engineering, it is definitely not something that can be done by someone of low rank or a lightweight position. Even with authorization, it is still not feasible because it requires strong project management skills. And upper-level leaders do not understand technology, and may not even comprehend why such a big fuss is made about performance (if you encounter such a situation, you can explain it like this: if they don’t care whether the system will “die” online, then there’s no need to make such a fuss).

Therefore, performance engineering must be driven by someone of high position and influence. As for the specific tasks, they can be delegated to the person responsible for performance implementation.

“Performance Engineering” is different from those seemingly advanced concepts #

At this point, you may think that some concepts are very similar to what I call performance engineering, such as end-to-end. I believe someone will ask, can we bring end-to-end to the forefront now? I don’t mean to downgrade the market value of end-to-end. In our culture of “saying cruise ship, doing raft”, many companies focus on what big companies are doing and just follow suit, without carefully considering the costs and consequences of these actions.

In fact, most companies cannot afford the organizational cost of end-to-end. If you only do some technical transformations, and call it end-to-end just because you bypass the main process, then it is too shallow. Because technical transformations are not the key issue, the key issue is to get it up and running after the transformation.

The purpose of doing end-to-end in the online environment is to achieve the goal of simulating real business pressure scenarios by using the architecture, software and hardware environment, data, network structure, etc. in the production environment. If after a series of technical transformations, you only run 30% of the business pressure, then it is not worth it. If you run 100% or even higher business pressure, and the business model also matches the production environment, then congratulations, you have done something very valuable.

However, many systems are not like internet systems with only one main business process. If the complexity of the business logic is high, the cost of errors is simply not something that an enterprise can afford. We can also understand without thinking too much that it is impossible to do various tests online like this. Therefore, the idea of end-to-end online is only valuable for certain specific business scenarios. And those companies that clearly don’t fit but still squeeze their heads up will naturally wake up in a few years.

As for DevOps, I won’t go into it here because it is more of a technical management perspective. I also don’t want to discuss what “left shift” and “right shift” mean (I don’t understand what these terms mean, isn’t it just doing the work? Why be so artistic about it?), because “left shift” and “right shift” sound like actively taking away someone else’s work, don’t they? One is doing the work that should be done, the other is snatching someone else’s work, it doesn’t sound righteous in terms of responsibilities. It’s like when I previously heard someone mention the term “full stack”, now in my life it’s just a topic of conversation at the dinner table.

In terms of performance, I do not recommend using these concepts to define boundaries because the perspectives are different, and only the word “engineering” fits what I want to express. At this point, I hope you can understand that performance engineering should be a complete engineering-level activity for a system’s lifecycle.

Performance engineering is not something up in the air #

I would also like to emphasize that you should not use the guise of “engineering” to do “testing” work, nor should you make engineering sound like something lofty. It is completely dishonest to talk about theories in great detail without any practical implementation. I have seen too many so-called “experts” who can eloquently discuss theories but hide away when it comes to putting them into practice.

I once encountered an expert on a project who had an extremely arrogant demeanor when he first arrived at the worksite. After he finished talking, I calmly said, “Okay, let’s solve the problem then.” To my surprise, this guy replied, “Just search on Baidu!” At that moment, I felt the urge to buy a gun, even the kind with a grenade launcher.

Regarding the concepts of performance engineering, I will try my best to explain. Please note that the concepts I am sharing are not dry and abstract ideas, but rather practical ones that can be implemented. Furthermore, in the following courses, I will show you how these concepts are put into practice.

Summary #

Alright, that’s all for today’s class. Let me summarize for you.

Given that in the current performance market, people tend to focus too much on the word “testing,” but from a testing perspective, it cannot solve the problem of whether the system will “crash” in the online environment. Therefore, I analyze performance from the perspective of “engineering.” If a company can plan the performance process from the perspective of “engineering,” it will definitely go beyond the scope of “performance testing.” Moreover, only by starting from the perspective of “engineering” can we truly ensure the normal operation of a system’s business.

Speaking of which, let’s redefine performance engineering once again. Obviously, this is the focus of our class today.

Performance engineering refers to the analysis of business logic and technical architecture, creating performance models, developing performance plans, preparing application environments, designing and implementing performance deployment monitoring, implementing realistic business logic stress tests, using monitoring tools to obtain performance counters of each component, analyzing the data collected by the counters, identifying the root causes of performance bottlenecks and optimizing them, and finally adjusting the scenarios based on the performance data of the production environment.

Homework #

Finally, I would like you to think about two questions:

What is the difference between performance engineering and concepts such as end-to-end testing, DevOps, etc.?
Describe your understanding of RESAR performance engineering.

Feel free to discuss and exchange ideas with me in the comments section. Of course, you can also share this lesson with your friends around you. Their thoughts may give you even greater gains. See you in the next lesson!

About the Course Reader Group

Click on the link on the course details page, scan the QR code, and you can join our reader group for this course. I hope that the discussions and collisions of ideas here can help you make greater progress. Looking forward to your arrival~