28 How to Determine the Scope and Logic Design of Abnormal Scenarios

28 How to determine the scope and logic design of abnormal scenarios #

Hello, I’m Gaolou.

In the field of performance, exceptional scenarios have always been a weak area. Everyone agrees that exceptional scenarios should be addressed, but they don’t know how to cover all exceptional problems comprehensively.

The difficulty in determining the scope of exceptions is because many issues are classified as “exceptions,” including high availability, reliability, and disaster recovery. Of course, some companies classify this part as non-functional requirements, so there are no exceptional scenarios in performance projects.

In my RESAR performance engineering theory, exceptional scenarios are a must-do because they require stress backgrounds.

Since exceptional scenarios need to be done, how should we approach them specifically? What issues should be tested to fully cover exceptional scenarios? This requires us to clarify two key points: the scope of exceptional scenarios and the design logic of exceptional scenarios.

Therefore, in this lesson, we will discuss how to determine the scope and design logic of exceptional scenarios.

Scope of Exceptional Scenarios #

In previous exceptional scenarios, the basic testing methods used were mainly host crash, network interruption, and application crash. In addition, there are some specific operations from the perspective of host, network, and application, such as:

Host: power outage, reboot, shutdown, etc.;
Network: ifdown command to disable network cards, simulate packet loss and delay retransmission, etc.;
Application: kill, stop, etc.

These operations are still effective in the current new technical architecture, but there are now more specific operations. With the increase of microservice applications, additional layers such as virtual machine layer, container layer, gateway layer, etc. have emerged. I will draw a diagram here to roughly list the different perspectives of exceptional scenario testing:

Regarding the scope and timing of exceptional scenarios, there are two topics that have been debated:

Should exceptional scenarios be part of performance projects?
What should be included in exceptional scenarios?

As for the first question, I think that regardless of code logic verification, functional verification, or performance verification, as long as we simulate real exceptional scenarios, there will be different sub-scenarios for exceptional scenarios. In the current testing market, many companies have indeed done this, which is a good phenomenon. And these exceptional scenarios need to be performed under the pressure background, so they should be part of the performance projects.

Because if this type of scenarios are completed in other stages, tasks such as scripting, parameter setting, monitoring, etc. all need to be repeated. If different teams need to collaborate, the cost will obviously increase.

As for the second question, you may feel strange because the above diagram has already listed all the content included in exceptional scenarios. Why do we need to mention it again? This is mainly because in the technical market, there are many different opinions and perspectives. Some people think that exceptional scenarios should also include high availability, reliability, scalability, stability, etc.

Actually, for these technical terms, we often have a vague understanding and feel that we know what they mean, but we cannot grasp the key points. Let’s take reliability as an example. In the implementation process, what we can think of about reliability is that a system runs without failure under certain time and conditions. But then, what is “stability”? We know that stability refers to a system running without failure for a certain period of time.

Hey, they seem to have similar meanings? What is the difference between “reliability” and “stability”? In my opinion, stability is included in reliability.

With that said, you should understand that in the current technical market, although many people have proposed different perspectives, if we list the corresponding implementation steps of these perspectives, you will find that they can all be included in the diagram I just mentioned.

Therefore, please remember that in exceptional scenarios, it is enough to include these contents in the diagram.

Design Logic for Exception Scenarios #

Logically, the design of exception scenarios can be divided into two steps:

Analyze the architecture: List all the components in the technical architecture and analyze the potential points of failure.
List the exception scenarios: Design corresponding scenarios based on the analysis of the points of failure.

On the surface, this logic does not seem complicated. If we only consider it at the component level, we can design universal exception scenarios. However, if we consider it from the perspective of business logic exceptions, there are no universal exception scenarios. We need to design different exception scenarios for different businesses.

However, in the field of performance, most people do not have a set pattern for designing exception scenarios. They usually rely on intuition. Even if you follow the above two steps to design exception scenarios, there is bound to be a question: Are all the exception scenarios covered?

To address this question, I suggest that you refer to the logic of the FMEA failure model in the design logic of exception scenarios, because FMEA is at least a set of logical design ideas that can provide us with a systematic approach.

FMEA is not widely used in the performance industry, and most people are not familiar with it. After I gained a deeper understanding of FMEA, I realized that it can be applied in the design of exception scenarios in performance projects as a method to analyze failure models. Therefore, the next question we need to discuss is how to apply FMEA to exception scenarios.

Let me give you a brief introduction to FMEA.

FMEA is a logical approach used primarily for analyzing the design and operation of complex systems, initially used in the design analysis of fighter jet operating systems and later widely applied in aerospace, automotive, medical, microelectronics, and other fields.

FMEA stands for Failure Mode & Effect Criticality Analysis and includes DFMEA, PFMEA, and FMEA-MSR:

DFMEA stands for Design FMEA, which analyzes potential failure modes during the design phase.
PFMEA stands for Process FMEA, which focuses on potential failures during the manufacturing or assembly processes.
FMEA-MSR stands for “FMEA for Monitoring and System Response,” which maintains functional safety by analyzing monitoring and system response (MSR).

The logic used in these three subdivisions of FMEA is consistent; they only differ in the stages and key points they focus on.

Currently, there are also attempts to apply FMEA in software systems within the IT industry. The most important aspect of FMEA is the table shown below, which you often come across online.

Let me explain the “RPN” in the table. It stands for Risk Priority Number, which is the product of Severity (S), Occurrence (O), and Detection (D). As for the other terms in the table, you can understand them by reading the words, so I won’t delve into them further here.

Do you think it is difficult to apply this table to IT architecture? Actually, when implementing specific exception scenarios, we can make some changes to the table according to our understanding:

I added a “System” column to the table because some projects have multiple systems. Of course, you can also omit this column and name the entire table for a specific system. As for the other terms, I made appropriate adjustments without changing the original table structure.

Before we fill in this table, I want to explain one thing. In FMEA, Severity, Occurrence, and Detection all require their own rating scales, ranging from 1 to 10. Below, I will roughly outline the meanings of different levels for these three aspects. Please note that I am only describing relatively general content and trying not to be too specific to any particular business.

Severity S-
Occurrence O-
Detection D-

For your own system, you do not necessarily have to strictly follow the classification in the table above, but you can still borrow the logic.

Now, let me give you an example of an abnormal use case to show you how severity, occurrence, and detection are applied:

Based on this table, you should know how to list the abnormal scenarios for your own system.

Please note that even if you want to use FMEA to design abnormal scenarios, the diagram I drew at the beginning of this lesson is still essential because it is one of the input conditions for this table. In other words, before filling out this table, we must have a clear understanding of what needs to be tested in abnormal scenarios, which is very important.

However, with this table and the 10 levels of severity, occurrence, and detection, abnormal scenarios suddenly become complex. Because there are too many possible combinations of PRNs, if we were to list all the abnormal scenarios in the system, we would have to execute them one by one in order of the PRN values. Let’s say there are 10 scenarios with a PRN of 1000 and 20 scenarios with a PRN between 900 and 1000…and so on. It would be a daunting task, wouldn’t it?

I remember chatting with a friend who has extensive IT experience about designing abnormal use cases. He said that if he were to design abnormal use cases, it would be no problem to come up with tens of thousands or even more. Then I said, “But what is the probability of these use cases actually occurring in production? If the system operates without encountering these situations until it reaches its end of life, what use do these cases serve?”

From this conversation, you can think about the following question: Do we really need to make our system’s abnormal scenarios so complex?

Of course not. In fact, we can simplify it, such as reducing the number of levels. As we mentioned earlier, in FMEA, severity, occurrence, and detection each have 10 levels, but for the abnormal scenarios of the system, we can define three or four levels. If you are concerned and feel that three or four levels are not suitable, you can use this logic based on the specific situation of your system, depending on how important your system is.

Overall, FMEA is a very comprehensive logic. The fourth edition of its white paper is over 130 pages long. If you are interested, you can take a look.

In fact, applying FMEA to the testing of abnormal scenarios is not complicated. What is complicated is how to define S, O, and D. Because during the specific definition, it is not as simple as listing three tables as I did earlier. It requires a detailed analysis based on the logic of the system.

Next, I need to clarify my point. Please remember that for the implementation of any methodological logic, do not use it excessively, but use it reasonably. From the perspective of the ideas and logic of the foreigners I have come into contact with, they are very fond of creating some research-related functions and extending a concept, which they can use to deceive others for a lifetime.

I remember there was a foreigner in a team I led, a young guy who was always working as a defect manager, chasing after the progress of bug fixes every day. One day, he came to me and said he wanted to resign. I asked him, “So what do you want to do then?” He said, “I want to do research.” I continued, “What do you want to research?” He replied, “I haven’t figured it out yet, but I want to do research.” When he said that, he seemed to think that research was something quite high-end. I smiled slightly and said, “Okay, go ahead then.”

I share this story not to say that FMEA is also a methodology that has not been carefully considered, but to say that when we look at foreign ideas, we must remain calm, and the superiority of one idea over another depends on time. We should also have our own complete thinking ability.

Therefore, I suggest that when designing abnormal scenarios, you can refer to the logic in FMEA, eliminate the parts that are not applicable, and design failure models that are suitable for your own system. The description in this lesson is just to give you an idea because teaching people to fish is better than giving them fish, which is my original intention.

The above content is my comprehensive description of the design of abnormal scenarios. Please note that the key point is not FMEA, but the scope diagram of abnormal scenarios mentioned earlier.

However, discussing so much without practical application is not in line with my style. Therefore, we still need specific operational examples. In the next lesson, we will use the perspective we drew at the beginning of this lesson to do a few practical cases to see how abnormal scenarios are executed.

Summary #

Different opinions exist regarding exceptional scenarios in the performance industry, and no one can convince anyone else. This results in each company having a different range of exceptional scenarios. In addition, there are different views on chaos testing and non-functional testing in the industry. Therefore, exceptional scenarios have never been fixed in performance projects.

In the RESAR performance engineering concept, it is imperative for performance personnel to be responsible for exceptional scenarios with stress backgrounds.

Through this part of the lesson, what I want to convey is the extent of the exceptional scenario scope and the logic of its design. With this knowledge, the coverage of exceptional scenarios will be comprehensive and orderly.

Homework #

Finally, I have two questions for you to think about:

What exceptional scenarios have you designed? Please describe your design ideas.
What exceptional problems have you encountered? Please give some examples.

Remember to discuss and exchange your thoughts with me in the comment section. Every thought will help you advance further.

If you have gained something from reading this article, feel free to share it with your friends and learn and progress together. See you in the next lesson!