17 Deployment Management, Low Risk Deployment Strategies

17 Deployment Management, Low-Risk Deployment Strategies #

Hello, I’m Shi Xuefeng. Today I want to talk to you about deployment management.

In the DevOps Annual State of DevOps Report, there are four core outcome measures, two of which are specifically related to “deployment” – deployment frequency and deployment failure rate. The other two measures are lead time and mean time to recover from failures.

For DevOps, the deployment activity is like the final sprint of the last 100 meters in software delivery. Only when the software is truly delivered to end users through deployment and release, does the value of the previous efforts truly come to fruition.

Deployment and release are often used interchangeably, but strictly speaking, they represent two different practices. Deployment is a set of technical practices that involve applying the completed functional entities (such as code, binary packages, configuration files, databases, etc.) from development, testing to specified environments, including development, staging, production, etc. The result of deployment is a change made to the servers, but this change may not necessarily be visible externally.

Release, on the other hand, is more of a business practice, which means formally enabling the deployed functionality to be visible and provide services to users. The timing of the release is often closely related to business demands. Many times, deployment and release are not synchronous: for example, for an e-commerce business, if a new promotion needs to go live at midnight, if deployment and release are not separated, it would mean that all server changes need to be completed one second before midnight, which is clearly not feasible.

So, here’s a question for you to consider: does low-risk release mean ensuring that the changes made in this release are flawless before executing the release action?

In fact, even if it’s not explicitly stated, many companies do it this way. Traditional software engineering processes also aim to ensure the quality of the delivered products through layers of quality measures. A typical example is the V model of testing, starting from unit testing, integration testing, system testing, user acceptance testing, and various specialized tests, all of which are aimed at discovering more issues before release to ensure product quality.

So, within the DevOps model, does it advocate the same quality thinking? I think this is a debatable question.

In reality, with the acceleration of release frequency, the time left for testing activities has become increasingly limited. At the same time, the complexity of today’s business is much higher compared to ten years ago. Each release involves multiple platforms such as PC, mobile, mini-programs, H5, not to mention hundreds or even thousands of end devices. It is already a challenging task to complete all the testing activities within a limited time. Moreover, companies are measuring the test-to-development ratio, which limits the growth of testing resources and even leads to a continuous decrease.

Of course, you can improve the efficiency of testing activities through automation, but exhaustive testing is already a false proposition. So, even though DevOps promises both speed and quality, is it a deception?

Of course not. The core lies in the change of quality thinking within the DevOps model. In summary, it means accelerating the release pace while ensuring a certain level of quality, and discovering issues as early as possible through low-risk release methods, online testing, and monitoring capabilities, and using the simplest means for quick recovery.

There are several key phrases here: “certain level of quality,” “low-risk release methods,” “online testing and monitoring,” and “quick recovery.” Let me explain each of them.

Certain Level of Quality #

How should we interpret this “certain”? For software in different forms, the standard of quality naturally varies. For example, I have a friend who manufactures satellites, and their software quality requirement is to achieve perfection after years of polishing, regardless of the cost. However, for fast-paced Internet businesses, everyone is used to the fact that problems may occur by default. Therefore, based on defined testing scope and coverage, as long as serious issues are fixed, the release can proceed. Lower-level issues can be addressed in subsequent crowdtesting and gray environment testing.

Therefore, in comparison to defining a release quality standard, it is more important to change the quality mindset of teams as DevOps is being promoted. Quality is no longer just the responsibility of the testing team, but the responsibility of the entire delivery team. If there are issues in the production environment, the team needs to work together to identify and fix them, and reflect on how to avoid similar issues in the future, learning from failures.

The extension of testing capabilities, on the one hand, provides tools and platforms to help development teams carry out self-testing more easily; on the other hand, strengthening types of tests such as online monitoring and embedding, can ensure that issues are quickly exposed online and gather necessary data for analysis of user behavior, which will comprehensively improve the overall release quality.

Low-risk deployment methods #

Since deployment is an unavoidable high-risk activity, it is necessary to have some means to reduce the risk of deployment. Typical methods include blue-green deployment, canary release, and dark deployment.

1. Blue-green deployment

Blue-green deployment involves preparing two identical environments for the application: one blue environment and one green environment. Only one environment provides online services at a time. The terms “blue” and “green” are only used to distinguish between the two environments. When a new version is launched, the application is first deployed to the environment that is not providing online services. It undergoes pre-deployment verification and, once verified, is ready for production. At the designated release time, the routing that points to the online environment is switched to the other environment, completing the entire deployment process.

In general, the implementation cost of this method is relatively high. As there are two identical environments, only one provides online services. To reduce resource waste, the other environment can be used as a pre-production environment for verifying new functionality before going live. Additionally, in this mode, a common approach is to use the same database instance and support multiple versions of the application through backward compatibility.

Image source: https://www.gocd.org/2017/07/25/blue-green-deployments.html

2. Canary release

Canary release, also known as incremental or phased release, is more flexible and cost-effective compared to blue-green deployment. Therefore, it is a more widely adopted low-risk deployment method in enterprises.

Canary release has various implementation mechanisms, the most typical being a progressive or phased rollout to complete the entire application release process. When releasing a new version of the application, it is deployed to a certain percentage of nodes based on a pre-designed plan. Once user traffic reaches these nodes, the new functionality can be used.

It is important to ensure consistent behavior for the same user, without showing them both old and new functionalities intermittently. There are various solutions to this, such as identifying users by ID or using cookies to assign them to different groups.

After the new version of the application passes the validation on some nodes, it is gradually rolled out to more nodes, proceeding in cycles until all nodes are deployed and all applications are upgraded to the new version. Batch deployment is just one method of implementing canary release. Using a configuration center and feature toggles can also achieve more targeted canary strategies. For example, canary release can be tailored to different users, regions, or device types.

For mobile applications, canary release is also essential. Let’s take the example of an iOS application to illustrate the steps in the release process. First, internal users of the company can download and install the enterprise package for self-testing and verification of the new version. Once the testing is successful, the new version can be made available to a selected group of external users through the official TestFlight platform, allowing them to install the new version via TestFlight. After confirming that the canary metrics meet expectations, the full user base can be upgraded.

Many applications now use the method of dynamically delivering pages, and feature toggles can be used to control different users’ access to different features.

Image source: https://www.gocd.org/2017/07/25/blue-green-deployments.html 3. Dark Deployment

With the rise of A/B testing, the method of dark deployment has become increasingly popular. Dark deployment refers to a method of performing online validation without the user’s knowledge. For example, in a back-end-first deployment, an interface containing new features is released online. At this time, since there is no front-end guidance for this interface, users do not actually call this interface. When users perform certain operations, the system will duplicate and redirect the user’s traffic to the newly deployed interface in the background to verify if the interface’s return results and performance meet expectations.

For example, in an e-commerce scenario, when a user searches for a keyword, there are two algorithms in the background that provide two different sets of search results. Based on the user’s actual behavior, the system can verify which algorithm has a higher hit rate, thus achieving online functionality validation.

Image Source: https://www.gocd.org/2017/07/25/blue-green-deployments.html

Of these three low-risk deployment methods, blue-green deployment is the best way to improve system availability when the overall application scale is not large, such as various hot-standby solutions, which are typical applications of blue-green deployment. However, for large-scale systems, considering cost and benefits, gradual release is clearly the most cost-effective approach. If you want to run some online tests to collect real user feedback, then dark deployment is a good choice.

Online Testing and Monitoring #

So, how to verify that multiple deployment patterns are working correctly? The key lies in online testing and monitoring. In fact, in DevOps, there is a brand new concept that monitoring is a comprehensive testing methodology.

You might ask, why perform testing online? Isn’t this a very unsafe practice? According to past practices, you should spend a lot of effort to establish a fully simulated pre-production environment, trying to simulate the content and environment of the online environment as much as possible in order to validate the usability of the functionality. However, any team that has done testing knows that a test environment can never replace a production environment. Even if you perform extensive regression testing in a test environment, there will still be various problems when moving to the production environment.

There is an interesting analogy about the difference between a test environment and a production environment: a test environment is like a zoo, where you can see various wild animals living well; a production environment is like nature, where you can never imagine the behavior of animals from the zoo after they return to the wild, as they face a completely unknown world. There are many reasons for this difference, such as differences in environmental equipment, user behavior and traffic, and dependencies on services. Each variable can affect the combination of results.

So, since it is impossible to simulate all the scenarios that will be encountered after the release in advance, how can we do online validation? There are three common methods.

1. Use gradual release and user testing to gradually observe user behavior and collect user data to verify if the new version’s usability meets expectations.

One of the main practices here is the use of tracking functionality. In Internet products, tracking is the most commonly used method for product analysis and data collection, and it is also one of the main sources of data-driven decision-making. Its value lies in collecting multi-dimensional data such as user behavior, product quality, and operational data based on pre-designed methods of collection and monitoring.

Large companies generally have their own tracking SDKs, which can automatically collect data and configure the granularity of collection based on product design requirements. For small companies, third-party statistical tools like Umeng can satisfy most needs.

2. User feedback.

In addition to automated data collection, user feedback is also the first-hand information for obtaining product information. There are many channels for user feedback, and most companies have a user operation and public opinion monitoring system that automatically crawls product information from various mainstream channels based on keywords, etc. Once negative feedback is identified, it is dealt with immediately.

3. Use online traffic testing. I also mentioned this point when talking about dark deployment, the most typical practice is traffic mirroring. In addition to conducting online A/B testing, the most commonly used method is to replicate real user traffic from the production environment and replay it in real-time or offline in the pre-release environment for functional testing.

In addition to that, there are many advanced techniques for traffic mirroring. For example, selectively filter certain information based on requirements, such as using read-only query content to validate search interfaces. Moreover, traffic can be scaled up or down by certain multiples to achieve service stress testing. Additionally, the return results of the production service and the pre-release service can be automatically compared to verify if the system behavior is consistent when the same traffic is sent to both versions. Furthermore, the data from traffic mirroring can be saved offline, which provides valuable accumulated data for analyzing and avoiding rare and difficult-to-reproduce user problems.

At the tool level, I recommend using the open-source tool GoReplay. It is implemented in Go and works at the HTTP layer, requiring minimal system modifications. It also supports the functionalities I just mentioned.

Rapid Recovery #

Once a new version is found to be unsatisfactory or has severe defects, it is crucial to quickly regain control and solve the problem. Mean Time To Recovery (MTTR) is one of the four core metrics of DevOps. Quality confidence in DevOps is not only derived from layers of quality gates and automated verification, but also the ability to quickly locate and fix issues.

MTTR can be further divided into Mean Time To Detect (MTTD), Mean Time To Identify (MTTI), and Mean Time To Recovery (MTTR). After a failure occurs, the problem should be preliminarily analyzed and located based on service availability SLAs, with a clear solution identified. In this regard, a good online diagnostic tool can greatly help in urgent situations. For example, Arthas, an open-source tool developed by Alibaba, can provide real-time monitoring of stack information, JVM information, invocation parameters, view return results, trace node time consumption, and even memory usage and decompiling source code, making it an excellent problem diagnostic tool.

After a preliminary analysis and location of the problem, there are two options: forward repair and rollback.

Forward repair means quickly modifying the code and releasing a new version, while rollback means reverting the system deployment application version to the previous stable version. Regardless of the chosen option, it tests the automation of the deployment pipeline and the ability to rollback, which is the best representation of a team’s release capability. In DevOps’ outcome metrics, the time described before the deployment is crucial. Of course, the best practice is an automated pipeline. Usually at this point, you would want the pipeline to be faster and more automated.

Finally, I’d like to bring up another point. You may have heard about “self-healing failures” at many conferences, which means that when a problem occurs, the system can automatically repair itself. It sounds a bit magical, but in reality, the first step towards self-healing failures is to establish service degradation and fallback strategies. What do these two professional-sounding terms mean? Don’t worry, I’ll give you an example for better understanding.

I have taken two screenshots of a shopping app for comparison. You can see the differences between them.

If you look carefully, you will notice that there are a total of 8 differences on this page alone. Therefore, service degradation refers to temporarily disabling non-main path functions during traffic peaks to ensure the availability of business. A typical approach is to use feature toggles to manually or automatically disable certain features.

Fallback strategies are used when extreme situations occur, such as unresponsive services, network interruptions, or exceptions when calling services, to prevent system crashes. Common practices include caching and fallback pages, as well as popular frontend techniques like skeleton screens.

Summary #

In this lecture, I introduced the shift in quality mindset under the DevOps mode. The key is to maintain a certain level of quality while accelerating the release pace as much as possible. By utilizing low-risk release methods, online testing, and monitoring capabilities, problems can be detected early and resolved quickly using the simplest means.

Quality activities come with costs, but in order to ensure fast iterative releases, a certain level of issues occurring is not the end of the world. What’s more important is to extend the quality activities both forwards and backwards, and strengthen monitoring and testing in the production environment. At the same time, three typical low-risk release methods can meet the needs of different business scenarios. When problems occur, it is important not only to identify and fix them quickly, but also to ensure the continuity of system services through mechanisms such as service degradation and fallback strategies in advance.

Thought-provoking question #

What measures does your company take to ensure that deployment activities are secure and reliable?

Feel free to share your thoughts and answer in the comments section. Let’s discuss and learn together. If you found this article helpful, please feel free to share it with your friends.