30 Taking a Step Forward, Case Studies of Dev Ops Practical Transformation in Large Scale Enterprises ( Below)

30 Taking a Step Forward, Case Studies of DevOps Practical Transformation in Large-Scale Enterprises (Below) #

Hello, I’m Shi Xuefeng. Today, we will continue our discussion on the story of Microsoft’s DevOps transformation, picking up from where we left off in the previous lecture.

People often wonder which team should be responsible for an enterprise’s DevOps transformation. Should a brand new DevOps team be formed? With this question in mind, let us take a look at how Microsoft approached it.

1ES #

Microsoft has a special team called 1ES. 1ES stands for One Engineering System, which means “a set of engineering systems” when translated directly. From the name, you may already see that within Microsoft, there is a unified engineering capability platform to support the development and delivery of various internal products. That’s right! This 1ES team consists of nearly 200 engineers. As an organizational-level R&D efficiency team, their goal is to improve internal R&D delivery efficiency through a complete set of universal engineering capability platforms.

The responsibilities of the 1ES team are not limited to developing a common tool platform. They are also responsible for cultural transformation, the introduction of the latest engineering methods, process improvement, security compliance checks, internal R&D efficiency consulting, and promoting best practices in the engineering teams. They can be considered a “full-function” enterprise R&D efficiency and productivity team. As of 2018, data shows that nearly 100,000 users collaborate on the platform provided by 1ES.

However, the current situation in China is that many companies are just starting to pay attention to R&D efficiency. Even if there are people responsible for similar tasks, they are mostly scattered within each business unit, making it difficult to form a cohesive force. There are only a handful of companies that have established an enterprise-level unified R&D efficiency team that can be compared to Microsoft’s 1ES, let alone building a unified engineering capability platform. I have seen a large company where there are over 1,700 internal tool platforms, without realizing how much redundant development and resource waste there is.

So, do you think Microsoft’s 1ES team is naturally “dominant”? It’s not quite that simple.

In fact, the history of the 1ES team can be traced back to 2014. At that time, Microsoft’s newly appointed CEO Satya Nadella attached great importance to building R&D capabilities and was committed to empowering R&D teams with the best tools. As a result, each department at Microsoft would purchase their own preferred tool platform based on their specific needs. This led to a huge difference in tools, processes, and maturity within the company. Differentiated tools and processes further hindered sharing and collaboration between teams, and the cost of transferring internal personnel was extremely high because they had to start from scratch when they moved to a new team.

To solve this problem, the 1ES team identified three major areas: work plan management, version control, and build capability. They first identified teams within the company that were not using the company’s unified tools, and then pushed them top-down. The core concept behind this approach is “Use what we ship to ship what we use”, which means using the tools they release externally to develop their own tools within the R&D teams.

You may have noticed that these three areas are all part of the main path of software delivery. Requirement and task management, version control, and build systems are all core systems. When you want to establish a unified efficiency platform, it is most important to focus on the core systems in the main path.

Regarding “how to extend a complete solution based on the core systems”, I recommend you read this blog by GitHub, which illustrates how they think about this problem.

Over the next few years, the 1ES team pushed VSTS (now known as Azure DevOps) to become the standard tool platform within Microsoft, and the platform’s user base grew from a few thousand people at the beginning to over 100,000.

It is the refinement of 150 iterations since 2010 that led to the outstanding success of the Azure DevOps product. It can be said that Microsoft’s Azure DevOps platform is unparalleled in the industry today in terms of design concepts, features, and user experience.

Continuous Delivery #

Continuous delivery is a core part of DevOps transformation, made possible by the unified engineering capability platform provided by 1ES. So, to what extent has Microsoft achieved continuous delivery?

Based on data from March 2019, they deploy 82,000 times every day, create 28,000 work items, have 440,000 pull requests each month, and perform 4.6 million builds and 2.4 million commits.

No matter which aspect of these data is mentioned, they are all astonishing, which reflects Microsoft’s excellent engineering capabilities.

Microsoft’s Continuous Delivery

So, how did Microsoft get to where it is today? Let’s first look at the most important and challenging part of DevOps, which is testing, and see how Microsoft manages to complete 60,000 test cases in 6 minutes.

In fact, as early as 2014, Microsoft encountered problems in testing similar to most companies: tests took too long, frequent test failures, unreliable mainline quality, and the quality at the end of the iteration fell far short of the release threshold.

How serious were these problems? Let me give you some numbers, and you will understand.

Automated tests took 22 hours each day.
Full functional automated tests took up to 2 days.
Only 60% of P0 level test cases executed successfully.
In the past 8 years, there was not a single day when all automated tests passed.

Not only that, team members had significant disagreements about unit testing: developers were reluctant to spend time writing unit tests; the team didn’t believe that unit tests could replace functional tests; and they even had conflicts in their philosophy regarding the use of mocks.

History often repeats itself. In my previous company, developers always found various reasons to persuade you that they didn’t need to write unit tests, or various environment issues prevented the completion of unit tests due to heavy reliance on external services.

Microsoft’s solution was to end this meaningless debate and move forward to achieve the expected goals. They started by advancing the areas with consensus and reorganized their internal testing model, as shown in the diagram below:

Microsoft’s Test Model

L0 level: These are unit tests without external dependencies. This part is executed in code merge requests and takes less than 60ms to run.
L1 level: These are unit tests with external dependencies, and the test time is generally less than 2 seconds, averaging around 400ms.
L2 level: These are interface-based functional tests executed in the pre-production environment.
L3 level: These are production-level online tests executed in the production environment.

After clarifying the overall strategy, the team began to transform testing activities. The entire transformation process can be divided into four stages:

Stage 1: Starting with L0/L1 Level Tests

In this stage, efforts were made to simplify the execution cost of L0/L1 level tests and write high-quality test cases.

Based on my experience implementing unit testing in enterprises, aside from the debate of “whether we should write unit tests,” the biggest controversy is about dividing responsibilities. From the perspective of work, it includes several aspects: tool and framework selection, rule consolidation and output, tool platform development, data measurement, and visualization construction.

To accelerate the promotion of unit testing, I suggest that the tool and framework selection in the early stage should be done by your own development and testing engineers or experienced DevOps engineers and be tested in pilot projects. Next, the development team should complete the consolidation of rules, including writing rules for unit tests, configuring the tool environment, etc. As for the platform, start building the necessary capability for unit testing. The goal is for the development team to only need to write unit test code, and let the platform handle the execution, data analysis, and reporting. Finally, promote it within the team and continuously update and iterate on the rules and tools. In this stage, try not to add new daily test cases.

Stage 2: Analyzing Existing Daily Test Cases

In this stage, the focus is on identifying the following:

Which test cases are outdated and can be deleted?
Which test cases can be shifted to L0/L1 level?
Which tests can be consolidated into specialized SDK-driven testing, like performance testing?

The goal of this step is to make the daily test case collection as lean as possible and speed up execution. After all, running these tests for dozens of hours each time with no second chance in case of failure is a problem.

Stage 3: Transforming Daily Tests into L2 level Tests

Interface testing is a more cost-effective testing type, so promoting the development of interface-driven automated testing can improve test execution efficiency and ensure coverage of the business.

In this stage, we need to improve the interface automation testing framework, provide various test types such as code, configuration, and multiple interface verification. In addition, we need to establish centralized management of the system’s APIs. On one hand, this involves API governance, and on the other hand, it strengthens the collaboration between development and testing based on APIs, making all changed versions of the APIs live. Once the development team updates the API definition, the tests can be updated in the same place as the test cases and mock data based on the APIs, achieving online collaboration based on APIs.

Stage 4: Building L3 Level Tests

These are online tests in the production environment that diagnose the health of the system through monitoring mechanisms. I mentioned this part in Lesson 17. If you don’t remember, you can review it.

As the number of L0/L1 level tests increased, all these tests could be included in code merge requests for automatic execution. Additionally, L2 level API interface tests can also be incorporated into the pipeline.

Through more than 40 iterations of continuous efforts, supported by the incentive mechanism, the distribution of the entire testing scenario has significantly changed.

As you can see, the number of daily tests decreased continuously, while the number of L0 level tests increased. Then, the number of L1/L2 level tests leveled off. You should know that these over 40 iterations took nearly three years to complete. If someone tells you they can finish unit testing in “3 months,” you should not believe them.

Distribution of Testing

Continuous Deployment #

The ultimate goal of continuous delivery is continuous deployment. So, what has Microsoft done in terms of deployment?

Firstly, Microsoft does not recognize the concept of semi-automated deployment. In reality, deployment actions are often not completed in one go. Some commands or steps are not online or frequent enough to be included in the tools, so they still need to be manually executed by copying and executing a command.

People often ask, “Isn’t it good enough that most of our operations are automated?” My answer is simple, “Can someone without a foundation or non-professional complete this task?” Frankly speaking, this has a bit of a “philosophical” nature, but in fact, if a platform is fully developed, but the result still relies on designated individuals to operate, then you have to think about the significance and future value of this matter.

I encountered a similar case when working on a project before. In order to solve the problem of configuration changes, the team members implemented a very complex task. However, during the review, we found that this task did not solve all the problems and eventually still required manual database operations. The cost of manual database operations is actually acceptable, but automatic for the sake of automation is not worth it, which ends up wasting time and effort.

So, in order to meet the requirement of deployment that everyone can do, what needs to be done is complete automation. Embed all operations into the pipeline and include them in version control to record change information. Use the same set of tools to achieve multi-environment deployment and distribute configurations for different environments through a configuration center.

There are many benefits to doing this. On the one hand, it can improve the robustness of deployment tools in different environments and avoid potential risks caused by deployment methods or tool differences. On the other hand, compared to deploying to production environment, there is less psychological pressure when deploying to a testing environment. Once everyone is familiar with the deployment process in the testing environment, deploying to the production environment becomes a piece of cake.

To achieve secure and low-risk deployment, Microsoft has introduced the concept of “deployment rings,” which can be understood as dividing the deployment activity into several stages. Every production deployment needs to go through a five-ring verification process, even for configuration changes, and there are no additional emergency channels. The five deployment rings are:

Canary (internal users)
Small batch external users
Large batch external users
International users
All other users

By progressively deploying a new version, each new version is slowly validated through each ring and gradually released to all users. There are several points worth learning from this.

1. Connecting CI/CD through pipelines

We can understand CI/CD in the following way:

The purpose of CI is to generate a package that can be used for deployment. This package can be a war package, tar package, ear package, or even an image, depending on the deployment method of the system.
The purpose of CD is to deploy this package to the production environment and release it to users.

Therefore, the connection between CI and CD lies in the artifact repository, which schedules the flow of deployment packages in the pipeline, thereby completing the promotion of artifacts. I discovered that many large companies use the method of repackaging before deployment to artificially separate the processes of CI and CD, which is not a good approach.

2. Continuous deployment does not necessarily mean fully automated

We all know that the ability of continuous deployment is the best indicator of a company’s DevOps capability (such as Microsoft, which can deploy over 80,000 times a day). So does this mean that every change should go through an automated process and be deployed to the production environment? The answer is not necessarily.

You can take a look at this panorama chart developed by Microsoft. Each deployment in the CD process requires manual confirmation to be completed. The core idea behind this is to control the “blast radius”.

Microsoft Development Panorama

Since it is not possible to completely prevent failures, is it possible to control the impact range? This is where the design concept of “deployment rings” comes in, and for this purpose, it is still necessary to have appropriate manual control.

So how do you confirm that a deployment is successful?

Microsoft has defined very detailed rules to guarantee the availability of online services, with the most important one being that the online service status always takes top priority. You might think that this should be the case from the beginning, right? However, in actual work, we will find that internal tool teams often focus on implementing new features and neglect online alarms.

In order to solve this problem, in addition to the concept of prioritizing online services, it is also important to establish corresponding rules. For example, Microsoft’s on-call engineers, known as DRIs (Designated Responsible Individuals), are required to respond to problems within 5 minutes on weekdays and 15 minutes on weekends, and this is included in the assessment of personnel and teams. In addition, by providing weekly and monthly reports on the status of online services, as well as detailed fault analysis of each incident, the idea of prioritizing online services is continuously reinforced internally.

Conclusion #

In this case study, I introduced several key points of Microsoft’s transformation, including automation testing capabilities, unified engineering platform and teams, hierarchical continuous deployment, organizational change, team autonomy, and cultural transformation. These are the most challenging issues that enterprises face in the actual DevOps transformation process. Has Microsoft’s experience inspired you? Of course, doing well in DevOps is definitely more than just these points.

For the process of DevOps transformation, Microsoft’s philosophy is:

A journey of a thousand miles begins with a single sprint.

This is what we often say, “A journey of a thousand miles begins with a single step.” DevOps is not magic that can take immediate effect, but it improves little by little each time, with everyone constantly thinking, “What can I do for DevOps construction?” Just like Microsoft’s automation testing transformation process, you can see the trend getting better and gradually becoming what it is now, with nearly 90,000 automated tests completed in about 10 minutes per commit.

Microsoft has been committed to promoting DevOps and constantly sharing its experience in various forms. Just from this point alone, we can see Microsoft’s cultural transformation and its transition towards openness and open source. Let me share with you some resources on Microsoft’s DevOps transformation that you can refer to.

Resource 1: https://docs.microsoft.com/en-us/azure/devops/learn/devops-at-microsoft/

Resource 2: https://azure.microsoft.com/en-us/solutions/devops/devops-at-microsoft/

Do you remember the J-shaped curve of DevOps transformation that I mentioned in Lesson 6? In fact, both DevOps transformation and development efficiency improvement are long-term and tedious processes. What you need to do is to build your confidence, do the right things, and expect good things to happen naturally.

Thought Question #

Case studies are a great way to learn about DevOps. Through case studies, not only can you learn from others’ experiences, but you can also learn about the underlying design principles of systems. So, do you have any good ways to study using case studies? Can you share them?

Feel free to write your thoughts and answers in the comments section. Let’s discuss and learn together. If you find this article helpful, please feel free to share it with your friends.