14 Built in Quality Wheat Field and the Lessons From Amazon

14 Built-in Quality Wheat Field and the Lessons From Amazon #

Hello, I’m Shi Xuefeng, and today I want to talk to you about a very important topic: built-in quality.

I have previously told you a story about a person at the end of an assembly line in an American car factory who uses a rubber hammer to check if the car doors are properly installed. I also mentioned that if a company relies on the “person with a hammer” to ensure quality, it indicates that there may be a problem with the process itself.

This viewpoint is not something I made up out of thin air. It comes from the classic quality management principles by Dr. W. Edwards Deming. The third principle states that quality should not depend on inspection, as inspection is both costly and unreliable. More importantly, inspection does not directly improve product quality; it only proves the existence of defects. The correct approach is to build quality into the entire process and demonstrate the effectiveness of the process through effective control measures.

Why is built-in quality so important? #

In traditional software development processes, the “hammer” for inspecting quality is often held by the testing team. They use a series of “hammers” to “beat” every aspect of the software product at the end of the software delivery process, in an attempt to find any potential issues.

The problem with this approach is that testing can only validate the quality of the product based on the known product design. But there may be potential risks that even the developers themselves are unaware of. For example, if the development team introduces some third-party libraries that have defects, there is a possibility of testing missing this scenario and leading to production incidents.

Furthermore, since the purpose of testing is to discover more defects, some teams’ performance assessments are directly related to the number of defect submissions and defect fixes. This assumes that the product already has defects. Consequently, the testing team tries to discover issues for the sake of discovering issues, stalking the development team, which creates a gap and opposition between development and testing, which is not what DevOps advocates.

So, the correct approach to solving this problem is to focus on built-in quality!

Regarding built-in quality, there is a classic case of Toyota’s Andon system, also known as the Andon cord. Above Toyota’s car production line, there is a cord that employees can pull if they find a quality issue. This will activate the Andon system to notify the management and stop the production line to prevent defective products from flowing downstream.

In the manufacturing industry, production lines are usually desired to run 24/7 to maximize productivity. But now, any employee can easily stop the entire production line. What was Toyota thinking?

In fact, the underlying idea is “Fail fast.” If a worker finds a defective product, but it requires multiple layers of approval to stop the production line, many defective products will flow downstream. Therefore, stopping the production line is not the goal; the goal is to detect and resolve problems promptly.

Once the Andon system is activated, relevant personnel such as management and quality control officers gather together to solve the problem and restore the production line as soon as possible. The experiences gained are then accumulated and integrated into the organization’s capabilities.

Built-in quality changes the fundamental perspective of viewing product quality. That is to say, everything the team does is not to validate that the product has issues, but to ensure that the product has no issues.

A few years ago, during my probation defense at Huawei, I was asked a question: “What is Huawei’s view on quality?” The answer was three words: “Zero defects.” At that time, I didn’t understand it. No one is perfect, so how can a product have zero defects? But later, I gradually understood that zero defects does not mean that the number of bugs in the product is zero. It is a quality concept that advocates for everyone to manage quality and build a culture of quality. Every individual should strive to discover and resolve defects as soon as possible in their work.

To summarize, built-in quality has two core principles:

The earlier the problem is discovered, the lower the cost of fixing it.
Quality is the responsibility of every individual, not just the quality team.

After discussing the above, you should have a preliminary understanding of built-in quality. So, next, let me introduce you to the ideas, steps, common issues, and solutions related to implementing built-in quality.

Implementation approach for built-in quality #

Since it is built-in quality, we should inject the ability of quality control into various stages of software delivery.

In the requirement phase, clear requirement admission rules can be defined, such as whether the value measurement indicators of the requirements are objective, whether the technical feasibility of the requirements has been verified, whether the dependencies of the requirements have been sufficiently evaluated, whether the requirement descriptions are clear, whether the requirement decomposition is reasonable, whether the acceptance conditions of the requirements are clear, and so on.

By implementing quality control at the front end of requirements, unreliable requirements can be reduced. In many companies, “one-sentence requirements” and “boss requirements” are very typical examples. Due to insufficient communication, development follows feelings and the result is completely different from what was intended, leading to rework and waste.

In the development phase, code review and continuous integration are excellent practices for built-in quality. In code review, efforts should be made to confirm whether the coding matches the requirements and whether the business logic is clear. In addition, a series of automated checks can be used to verify coding styles, risks, security vulnerabilities, etc.

In the testing phase, various types of automated testing and manual exploratory testing can be used to cover security, performance, reliability, etc., to ensure product quality. In the deployment and release phase, various measures such as database monitoring, dangerous operation scanning, and online business monitoring can be added.

From a practical perspective, quality can be controlled in each stage. So, which stage should we prioritize?

According to the first principle of built-in quality, we know that if defects can be discovered and fixed as soon as the code is submitted, both cost and impact will be minimal. If quality problems are discovered after the product is launched and then the problems need to be identified, fixed, and the software re-released, the cost will increase exponentially.

Therefore, the development phase, as the origin of the entire software product, is the best choice for built-in quality. So, how should it be implemented specifically?

Implementation Steps for Built-in Quality #

Step 1: Choose the appropriate type of inspection

Taking code inspection during the continuous integration phase as an example, there are various types of inspections, such as unit testing, code style check, code defects and vulnerabilities check, security check, and so on. However, not all of them are necessary. Especially when starting to implement these inspections, it is not advisable to apply all of them at once, as it would hinder the development process.

Therefore, choosing inspection types with a relatively higher return on investment is a reasonable strategy. For example, compared to code style check, code defects and vulnerabilities check is more important because code defects and vulnerabilities can lead to incidents in the production environment. Therefore, when it comes to client-side business, it is recommended to prioritize the implementation of Infer scanning. Although we should not ignore code style issues, they do not need to be enforced immediately.

Step 2: Define metrics and reach consensus

After determining the types of inspections, specific quality metrics need to be defined. Quality metrics consist of two levels: metric items and reference values. Let me explain them separately.

Metric items are specific indicators adopted for the type of inspection. For example, the metric item for unit testing coverage could include lines, instructions, classes, functions, etc. Which one should we follow? Generally, it requires reaching a consensus with the head of the development team, taking into account industry best practices. For example, line coverage for unit tests is often a good choice.

In addition, when enabling inspections in existing projects, there is usually a large amount of technical debt. I will elaborate on technical debt in the next lecture. Simply put, technical debt refers to a pile of debts that cannot be paid back immediately. In such cases, it is more appropriate to choose dynamic metrics, such as incremental code coverage, which only focuses on the incremental code without requiring changes to the legacy code.

After defining the metric items clearly, we need to define reference values. These reference values will directly affect whether the quality gate is triggered and the subsequent actions taken.

Let me briefly explain quality gate. The quality gate is like a security gate where inspections are performed. If the metrics do not meet the requirements, the gate will sound an alarm and prevent passage. It is similar to when traffic police check for drunk driving. If the alcohol content exceeds a certain threshold, an alarm will be triggered.

Defining reference values is an art. It is difficult to define values in a “one-size-fits-all” manner for different projects or even different modules within the same project. I recommend combining static metrics with dynamic metrics.

Static metrics are fixed values. For issues like defects and security, a zero tolerance approach is adopted, which means that they must be addressed if they exist. On the other hand, dynamic metrics mainly evaluate increments and trends. For example, if the baseline value is 100, you can define the reference value as less than or equal to 100, which means no increase is allowed. You can also define different reference values based on different issue severity levels, such as strict inspections for critical and blocking issues, with looser restrictions for others.

Finally, it is important to reach a consensus with the development team on these metrics. In other words, the team must acknowledge and follow them. Therefore, when defining metrics, it is necessary to fully consider the suggestions and input from the team.

Step 3: Establish automated execution and inspection capabilities Whether the company uses open source tools or self-developed tools, it needs to support the ability for automated execution and inspection. Depending on the timing of the inspection, you can also integrate the quality gate function on the testing platform and release platform, and provide feedback on the inspection results.

Following the principle of rapid failure, the effective node of the quality gate should be as close as possible to the generation of metric data. For example, if you want to check the coding style, the best time is during development in the local IDE, followed by checking in the version control system and providing feedback, rather than waiting until the final release to provide feedback on the failure.

Modern continuous delivery pipeline platforms have the function of quality gates, and there are two common configurations and activation methods:

Configure rules on the continuous delivery platform, which means combining different metrics and reference values to form a set of rules, and associate the rules with specific execution tasks. The advantage of doing this is that each subsystem that generates metric data only needs to provide the data to the continuous delivery platform, and whether the gate is passed or not is determined entirely by the continuous delivery platform. In addition, it is generally configured by quality personnel, and providing such a separate entry point can simplify configuration costs. The specific implementation logic is shown in the figure below:

Configuration on Continuous Delivery Platform

Configure quality gates in each subsystem. For example, configure the gate metrics on the UI automation testing platform, and when the continuous delivery platform calls the UI automation testing, directly provide feedback on the gate judgement result. If the check fails, the pipeline will fail directly.

Step 4: Define problem handling method

After completing the above three steps, automated inspection has already begun, and the results and handling methods of the inspection are very important for the effectiveness of the quality gate. Generally speaking, quality gates have an enforcement property, which means that if the inspection metrics are not met, it will immediately stop and provide feedback.

During the actual execution process, there may be multiple options for the results of the quality gate, such as failure, warning, manual confirmation, etc. These need to be clearly defined when formulating the rules. Progressively controlling the quality can be achieved through certain warning values and manual confirmation methods, with the goal of continuous optimization.

In addition, you need to advocate the quality rules and gate standards to all members of the software delivery team, and clarify the notification methods, handling methods for failures, etc. Otherwise, if problems are detected but no one handles them, the gate will become meaningless.

Step 5: Continuous optimization and improvement

Whether it is the inspection capability, metrics, reference values, or handling methods, only when it is up and running can you know if there are any problems. So, in the early stages of implementation, there should be a certain degree of flexibility, such as revising metric rules, adjusting metric levels and reference values, etc. The core goal is not to pass the quality gate, but to improve the quality, which is the most important.

Common Issues with Built-in Quality #

Built-in quality may sound simple, but executing it effectively can be challenging. So, what are some common issues? I have summarized a few common problems and suggested solutions in the table below. You can refer to it.

Table

Lastly, I want to share a story about Amazon. In 2012, the Safety Monitor system was introduced at Amazon. If front-line customer service receives customer feedback or observes potential quality and safety risks in a product, they can send an alert email and mark the product as “unavailable for purchase,” effectively forcing its removal from the platform. Can customer service really take such actions without any approval? Isn’t there a risk of complaints from suppliers?

In fact, this is a true reflection of Amazon’s customer-centric philosophy and principles; everyone is responsible for the final quality, without exception. When employees realize they have been entrusted with such significant authority, everyone will do their best to ensure quality work. Even if there are occasional mistakes, they are valuable learning experiences within the team.

In a company, establishing rules for quality control or developing a platform system is not the most difficult task. The difficult part is how many normal procedures have gone through special approvals in the actual process? How many releases have taken the emergency path? And how many people will say that implementing quality control will hinder business delivery?

Ultimately, you need to ask yourself how much effort you are willing to invest in practicing your own principles and beliefs before talking about it. I believe that reaching a consensus on this point is the ultimate key to implementing built-in quality effectively.

Summary #

In this lecture, I have introduced the background and principles of built-in quality through two stories. The principles are to detect and fix problems early and that everyone is responsible for quality. Additionally, I have introduced the five common steps for implementing built-in quality. I hope you always remember that quality is built, not tested. By mastering built-in quality, you unlock the secret of achieving high efficiency and high quality in DevOps.

Thought-provoking Question #

Does your company implement mandatory quality access control? Can you share some rules that you think have been effective?

Feel free to write your thoughts and answers in the comment section. Let’s discuss and learn together. If you find this article helpful, please feel free to share it with your friends.