29 Taking a Step Forward, Case Studies of Dev Ops Practical Transformation in Large Scale Enterprises ( Above)

29 Taking a Step Forward, Case Studies of DevOps Practical Transformation in Large-Scale Enterprises (Above) #

Hello, I am Shi Xuefeng.

The name “Step Forward” comes from a book called “Lean In: Women, Work, and the Will to Lead” by Sheryl Sandberg, the Chief Operating Officer of Facebook. In her book, she encourages women to step forward in the workplace, face challenges bravely, and pursue their life goals.

I chose this name as the title of the case study because DevOps transformation is not an easy task for a company. We also need the courage to take a small step forward and embrace this mission. Even if it is just a small change, it is an essential force in the transformation process that cannot be ignored.

In the final part of this column, I will use two posts to introduce Microsoft’s DevOps transformation story over the years, as well as my practical summary and experience in Chinese enterprises.

Today, let’s start with the management practices and cultural aspect to see how this traditional software giant managed to regain its leading position in the era of container cloud and AI after losing its way during the mobile internet era.

Microsoft’s DevOps transformation was not a sudden decision. With the rise of cloud services, the demand from users surged, leading to higher requirements for release pace. This can be reflected in the quantity of demands. The number of user demands in 2016 exceeded the total of the previous four years, and in 2017, it reached twice the number of 2016. This required the teams to deliver at a faster pace.

You should know that if you expect a 10% improvement in delivery capabilities, it is possible to achieve it through localized optimization within the existing organizational structure and process rules. However, if you want to achieve a 200% improvement, significant changes need to be made.

Establishing Feature Delivery Teams #

Microsoft’s previous organizational structure was similar to many companies, where teams were divided according to functions, such as project management, development, testing, and operations. Each team was relatively closed, and the problem of departmental silos greatly affected the efficiency of collaboration within the teams.

To address the pressure of demand delivery, Microsoft first underwent an organizational restructuring and integrated the development and testing teams into an engineering team. As a result, the role of testing disappeared within the team and was replaced by development engineers focused on development and development engineers focused on testing, working together with the product management team to accomplish agile project advancement. In the agile concept, testing activities should be embedded in the development phase. By integrating the two departments, the work of injecting testing into development was achieved.

Although the development and testing teams merged together, the delivery work still relied on an independent operations team to complete. This created a problem: even if the development efficiency is high, it is meaningless if the operations capability cannot keep up.

Thus, Microsoft embarked on a second organizational transformation, with the core being to build feature delivery teams and empower them with autonomy.

The so-called feature delivery team is what we often refer to as a “cross-functional team.” In practice, this means transforming the horizontally organized functional organization into a vertically cross-functional organization. This team includes all roles necessary for successful feature delivery (such as product, development, testing, and operations), allowing for a closed-loop completion of the entire delivery process.

In this process, Microsoft introduced a form called a self-organizing team. Unlike the traditional top-down approach to organizing by management, employees could freely choose which feature team they wanted to join. This new approach of forming teams provides an opportunity for everyone to learn new knowledge.

You may think that this approach would result in chaos in the organization. Highly skilled individuals would hope to work with other highly skilled individuals, so what about the remaining colleagues? In fact, I have seen a similar practice in a Chinese company, where they addressed this issue through a “mentorship” model.

For example, if a role depended on a specific skill that was scarce within the team, problems would arise for the remaining teams regardless of which team the member possessing that skill joined. Therefore, this company enforced a “senior mentoring junior” model, where experienced employees provided concentrated training to new team members, empowering them quickly. Moreover, these “mentor-apprentice relationships” would persist long-term, and new members could consult their mentors if they encountered any issues. Of course, new members would also undergo corresponding evaluation mechanisms. This model helps the company achieve the goal of mutual learning among internal members.

According to statistics, although less than 20% of employees chose to change positions, this approach provided the possibility of choice for 100% of employees. For a company known for its bureaucracy and politics, this greatly mobilizes the enthusiasm of employees.

In fact, feature delivery teams have several notable characteristics:

  1. They have their own dedicated office space where everyone sits together and can communicate by shouting.
  2. Typically composed of 10 to 12 team members.
  3. They have clear work goals and responsibilities.
  4. To ensure stability, team composition does not change for the following 12 to 18 months once formed.
  5. They have control over deploying features to production environments.
  6. They have team autonomy.

Interestingly, a large Chinese company also did something similar in the early stages of driving DevOps transformation. To accelerate the integration of development and operations, they split a large application operations team into various business lines.

Moreover, development teams began transitioning to full-stack, taking on operations work. With the release of operational work, the team was required to upgrade their capabilities. The operations team needed development capabilities to continually develop and optimize operational tools for reducing the costs of development and operations.

This process may sound easy, but in practice, it requires strong organizational execution and even requires endorsement from top management to implement such requirements from top-down. The transition process is painful for everyone, but only through such drastic changes does DevOps transformation become a reality, rather than just being said or done within a few small departments.

I often say, “It is not realistic to achieve enterprise DevOps transformation without changing processes.” Whether the organizational structure of the team needs to be adjusted or not depends on the efficiency of delivery.

The initial introduction and promotion stages of tools in the transformation have less impact on the organization, but when the transition enters the “deepwater area,” organizational change becomes a very realistic problem.

According to Conway’s Law, a team delivers a system structure that matches their organizational structure. In fact, from a different perspective, the software delivery process also aligns with the current organizational structure. As long as there is an independent testing team, there will always be an independent testing phase. It is because of these individual phases that departmental silos and efficiency bottlenecks in internal collaboration arise, and these are all things that DevOps transformation needs to consider.

Self-organizing Agile Teams #

Returning to the case, in order to promote the autonomy of feature delivery teams, Microsoft has made some adjustments in agile development planning.

Firstly, they divided them into four types of plans based on different dimensions.

  • Iteration dimension: set as a 3-week iteration;
  • Plan dimension: includes 3 iterations;
  • Season dimension: a 6-month period including two plan cycles;
  • Scenario dimension: a visionary picture lasting 18 months.

Among them, the management is responsible for planning long-term goals and panoramas, which is the answer to the question “where are we going?”; while short to medium-term goals, namely iterations and plans, are decided by self-organizing teams, which answers the question “how do we get there?”.

The delivery rhythm is carried out according to iterations, with a portion of value output every 3 weeks. As the iterations progress, the plan goals are gradually updated and optimized, and feedback is given to the long-term planning for interaction and adjustment. In other words, the 6 to 18 month long-term plan is not set in stone. The team will make adjustments based on the delivery increments of each iteration and plan, as well as user feedback, establishing a “plan-deliver-learn” closed-loop path. This continuously calibrates the product goals and overall direction, ensuring the effectiveness of long-term planning and avoiding potential problems in deciding the future development path at the beginning of the project under the waterfall model.

After all, in this rapidly changing era, no one can guarantee that your plans are fixed and always valid.

Now, feature delivery teams have the autonomy to decide their iterations and plans. But don’t forget, every successful project requires the collaboration of hundreds or even thousands of people. So, how can we ensure the consistency of team goals and the coordination among team members?

Microsoft has introduced three practices: iteration emails, team communication, and experience reviews.

1. Iteration emails

At the beginning and end of each iteration, the team sends out iteration plan and status emails. In these emails, besides clarifying the completion status of features in this iteration and the delivery plan for the next iteration, to help other team members better understand the features of this iteration, they also record these features into videos attached to the email. Moreover, the to-do list and kanban status are also attached in the form of links in the email.

2. Team communication

At the completion of each iteration, team members ask themselves three questions:

  • What are the contents of the next to-do list?
  • What are the accumulated technical debts and non-functional features?
  • What are the outstanding issues?

Each member of the team personally completes this task, not only to reduce information loss but more importantly, to establish a sense of ritual to help everyone arrange the iteration plan more rationally. After all, once the tasks are arranged, they must be completed on time. 3. Experience Review

At the beginning of the analysis of requirements, user stories are used to describe the current situation of users and the problems that this feature aims to solve from the perspective of users in a scenario-based manner. With this approach, members of different teams can implement this functionality from the perspective of users.

What’s particularly interesting is that Microsoft, in managing features, tries to maintain the connection to the original user requirements. They attach the original user requirements next to the feature.

Many times, the tasks that development needs to handle are translated user requirements by product personnel, rather than the original user requirements. As a result, we don’t know what the core problem we need to solve is when we are developing. By connecting the original user requirements, everyone can review whether the delivered functionality is what the users want from the perspective of the users during the development, testing, and delivery processes.

These changes have brought about a series of positive impacts.

First of all, the enthusiasm of team members has greatly increased. Because they see themselves as the primary responsible persons for user experience, they feel responsible for fixing and solving the actual problems that users face.

Secondly, the team no longer needs to wait for leadership’s planning. Within the overall project plan, they can make their own plans.

Finally, the updates to the plan are driven by continuous learning. For example, the team would add event tracking to frequently used features by users, observe the data on user usage, and regularly pay attention to and address user feedback.

Continuous incremental delivery and continuous feedback are also the best means of ensuring the effectiveness of product requirements at present. After all, business agility is the origin of DevOps. If the business itself does not have a clear method of measuring requirements, even if it has the strongest ability for continuous delivery, it is as if “running blindfolded”. Therefore, to promote DevOps, both agile development practices and requirement value analysis are essential elements.

At Microsoft, in order to promote effective feedback, their measurement system is also very good, which is worth mentioning. For Microsoft, obtaining real user behavior data is crucial. When building the indicator system, they mostly consider which indicators have a direct impact on business measurement, rather than measuring team output and individual output.

The indicators they use include the following aspects:

  • Usage dimension: user growth, user satisfaction, feature delivery status, etc.
  • Efficiency dimension: build time, self-testing time, deployment time, etc.
  • Health of online sites: time for error localization, time of user impact, duration of postproduction issues, etc.

However, some popular domestic indicators have not been included in performance assessment, such as completion time, lines of code, number of defects, etc.

You may say that this is nothing special, but you have to know that Microsoft’s attention to users goes beyond that.

Let me give you a specific example. Typically, when measuring system availability, we are dealing with the overall system, such as ensuring that the overall availability reaches 4 nines, which means it is available in 99.99% of the time. However, Microsoft believes that system availability should go further and be measured and counted on the user account level.

When we view the problem from the perspective of the overall system, the behavior of many individual users is masked by the overall data, i.e. “averaged out”. However, if we view the problem from the perspective of the user account, i.e. the perspective of each user, we will find that users have really encountered some problems.

For example, if the frequency of service unavailability under a certain account is relatively high, instead of waiting for users to complain online, it is better to proactively contact them in advance and help them solve the problem. In the email contacting the users, the team should not only clearly describe the objective situation observed by the team but also provide suggested solutions. If users are unable to locate and fix the problem by themselves, they can also contact the team through the contact information in the email to seek further assistance.

Microsoft’s attention to users is not only reflected in the measurement of system availability but also in the aspect of feature toggles.

Feature toggles are a common technique for controlling whether a feature is visible to the outside at runtime. In Microsoft’s products, this technique is also widely used, but their feature toggles can be refined to the user level, i.e. users can be added to or removed from the list to control the visibility of the feature for each user.

In this way, if certain new features affect the usage of specific users, they can be handled in this way, without the need for deployment, and the features can be shut down directly. This not only helps to solve problems quickly but also provides a more refined experimental mechanism. Compared with progressive deployment, feature-based deployment is more flexible.

Starting Team Transformation with Medium-sized Teams #

In terms of choosing teams for transformation, Microsoft’s experience has shown us that it is preferable to avoid starting with large teams.

In the process of DevOps transformation, the common mindset is to first handle the most core and largest teams within the organization. The idea is that once the most complex part is taken care of, the needs of other medium and small teams can also be met, and they will naturally catch up with the pace of transformation.

However, in reality, these large teams often have unique processes and specific requirements. They have a high degree of customization in terms of system tools and processes, making implementation more complex. In fact, for them, the priority of transformation work may not be the highest, and various requirements can lead to delays in transformation work. This is not a desirable situation for the transformation process.

Therefore, Microsoft adjusted their strategy and adopted a “middle-out” approach, focusing on medium-sized teams (40 to 100 people). These teams, due to their limited resources, have a strong demand for external support. Moreover, teams of this size can quickly assess the current situation and gather necessary information about their team, rather than guessing what they actually need.

By continuously making small improvements to help these teams perform better, internal communication enables more teams to proactively reach out to them and seek assistance, thereby establishing an effective cycle of continuous improvement.

Summary #

Today, I introduced the first half of Microsoft’s DevOps transformation. Let’s summarize it briefly.

  • In order to meet the demand for rapid delivery, they broke down the original organizational structure and established cross-functional organizations focused on feature delivery.
  • Through team autonomy, they divided plans into short-term goals and long-term goals, and short-term goals (including iterations and plans) are determined autonomously by feature teams.
  • In terms of measurements, they focused more on the performance of business indicators. Moreover, whether it is in terms of system availability or feature switches, they have detailed them to the specific user level to ensure the user experience for each user.
  • In choosing transformation teams, they actively avoided the most complex teams, starting with medium-sized teams that they could grasp, accumulating successful experience, and then continuously spreading it.

Finally, I’d like to share my own thoughts. In the past two years, features have been increasingly in the spotlight in the field of DevOps. This is because features are more in line with the granularity of requirements that meet the principles of rapid delivery in DevOps. Therefore, major companies in the industry have many practices and considerations in areas such as feature-based requirement management, feature-based branching strategies, feature-based release and value tracking strategies. For example, CloudBees released their SDM product this year, which is based on the feature dimension.

I believe that future DevOps will also develop in this direction. Building a complete set of development models based on feature development is worth our time and effort to think about.

Thought Question #

What are the effective processes, tools, and rules for feature-driven development and delivery?

Feel free to leave your thoughts and answers in the comments section. Let’s discuss and learn together. If you find this article helpful, please feel free to share it with your friends.