21 Traffic Replay the Magic Weapon to Protect Business Technical Upgrades

21 Traffic Replay - The magic weapon to protect business technical upgrades #

Hello, I’m He Xiaofeng. In the previous lecture, we learned about the application of the clock wheel in RPC, and the core principle is the keyword “divide and conquer”. We can use it in any scenario that requires efficient processing of a large number of timed tasks, with the most representative being request timeout detection in a high-concurrency scenario.

After reviewing the key points of the previous lecture, we will now move on to today’s topic and take a look at the application of traffic replay in RPC.

If you often read technical articles, you may have come across the term “traffic replay” more than once. Let me briefly introduce it. The so-called traffic refers to all the requests within a certain period of time. We record all the requests sent to Application A through some means, and then forward these requests to Application B uniformly, so that the request parameters received by Application B remain consistent with those received by Application A, thereby achieving the effect of re-requesting the requests received by Application A in Application B. We call this whole process “traffic replay”.

It’s similar to a football game tonight that I can’t watch due to a lack of time, but I can use video recording technology to record the game and watch it anytime. The footage is exactly the same.

So, what can the replay function be used for in the system development process?

What can traffic replay be used for? #

Personally, I feel that when we are immersed in writing code and completing business functions in our daily development process, it is a very happy thing. What gives me a headache is the testing phase after the code development is completed.

In a team, we often develop multiple requirements in parallel. In the process of developing new requirements, we may also have to refactor and split applications. At this point, it is basically difficult for us to avoid modifying old logic, and whenever there is a modification, there is a possibility of insufficient consideration. If you are more rigorous, after completing the development, you may run all the test cases in the project and supplement the test cases for the new functionality at the same time, only when all the test cases pass can you feel at ease.

In the code, this approach usually works fine for small changes in business requirements. However, for major changes in an application, such as many fundamental logics being modified, it is difficult to ensure that there will be no problems after the application goes online if you still verify the functionality using existing test cases. After all, we rely on our own maintained test cases, which are relatively lacking compared to the real environment of running online.

At this point, we would seek help from more professional QA testers, hoping that they can add more test cases from a QA perspective. But because the scope of our code logic changes is relatively large, it is difficult to define a relatively certain test scope. To be honest, the safest way at this time is for QA to perform regression testing on the entire project. This approach maximizes the avoidance of problems after going online, but from a probability perspective, it is not foolproof, because the online environment is not only complex but also difficult to evaluate the usage scenarios. Additionally, this approach is also time-consuming.

This is why I think it is the most headache-inducing issue. Relying on traditional QA testing not only takes time, but the results are also not completely reliable. So, is there a more reliable and cost-effective solution?

The fundamental reason why traditional QA testing encounters problems is that the modified application behaves differently after going online compared to before. And our testing goal is to ensure that the modified application behaves the same as the original application. Our test cases are also trying to simulate the running behavior of the application online, but test cases maintained by our own enumeration method cannot represent all behaviors of the application. Therefore, the best way is to validate it using online traffic. However, it is not feasible to directly launch the new application online because if there are any issues with the newly modified application, it may cause damage to the business of the online calling party.

We can think of another approach. I can first save the request parameters and response results from a period of time online, and then send these request parameters again in the newly modified application to compare whether the response results before and after the modification are consistent. This indirectly achieves the effect of using online traffic for testing. With the online request parameters and response results, combined with the continuous integration process, we can verify the modified code using online traffic at any time. It’s just like me recording a sports game video - as long as I want to watch it, I can always take it out and watch it again.

How to Support Traffic Replay in RPC? #

So, how can we implement traffic replay in practice?

There are many common solutions, such as TcpCopy and Nginx. However, when we need to use these tools in the production environment, we still need to ask the operations team to install the application into the application instance and configure it according to our requirements. The whole process is cumbersome and repetitive. Is there a better way, especially when the application uses RPC?

As we have mentioned before, RPC is used to achieve communication between applications, which means that all requests and responses between applications will pass through RPC.

Since all requests go through RPC, can’t we easily obtain the input and output parameters of each request in RPC? After getting these parameters, we just need to record them and asynchronously send the recording results to a fixed location for storage. This completes the recording function in traffic replay.

Once we have the real request parameters, the next step is to forward these request parameters to the application we want to regression test. In RPC, we call the application that can receive requests the service provider. This means that we only need to simulate an application calling party, resend the received request parameters to the application we want to regression test, and then compare the recorded request results with the results of the new request. This completes the effect of request replay. The whole process is shown in the following figure:

Compared to other ready-made traffic replay solutions, we have built-in traffic replay functionality in RPC, which is more convenient to use. We can also do more customization, such as online starting and stopping, method-level recording, and other personalized requirements.

Summary #

Ensuring the stability of online applications is something our development team strives for every day, whether it’s through upgrading the application architecture or fixing existing problems. The reality is that we not only need to guarantee the stability of existing businesses, but we also need to quickly fulfill various new business requirements. During this process, our application code will frequently change, which may introduce new instability factors, and this process will continue to occur.

In order to ensure that our business behavior remains the same after application upgrades, we mostly rely on existing test cases for validation. However, this approach is not completely reliable to a certain extent. The most reliable approach is to introduce online test cases to validate the transformed application by replaying real traffic from the online environment. This not only saves the entire deployment time but also compensates for the shortcomings of manually maintaining test cases.

After introducing RPC into the application, all request traffic will be taken over by RPC, so it is natural for us to support traffic replay functionality within RPC. Although this functionality is not the core functionality of RPC itself, for those who use RPC, having this functionality allows them to upgrade their applications with more peace of mind.

Reflections after Class #

In addition to using traffic replay to validate the modified application logic as mentioned earlier, what other meaningful things can we do with traffic replay?

Feel free to leave a message and share your thoughts with me. You are also welcome to share this article with your friends and invite them to join the learning process. See you in the next class!