29 From Monthly to Daily How to Speed Up Version Releases

29 From Monthly to Daily How to Speed up Version Releases #

Do you remember the goals we set for continuous delivery? As I mentioned earlier, Taobao’s efficiency goal is “211”, which means a 2-week delivery cycle, a 1-week development cycle, and a 1-hour release duration. For some more agile products, we may even accelerate to one version per week. With such a fast pace, how can we ensure the quality of our products? What measures can we take to further “speed up and ensure quality” in our releases?

More broadly speaking, the general definition of a release is not limited to submitting applications to the market. With gray releases, A/B testing, operational activities, resource allocation… we have more and more types of releases, and they are becoming increasingly complex. How can we establish a robust release quality assurance system to prevent online incidents from occurring?

APK Gray Release #

When discussing the speed of version releases, we need to consider both efficiency and quality. If we don’t consider delivery quality, releasing one version per day is very easy. However, it is not easy to release a version every two weeks while ensuring strict delivery quality. Especially when there are urgent security or stability issues, we need to be able to release within one hour.

As I mentioned in my column “How to Test Efficiently,” achieving “high quality and efficiency” releases requires a powerful testing platform and data validation platform.

Now let’s take a look at the important factors that affect the efficiency of version releases and my practical experience in improving version release speed.

1. APK Gray Release

The testing platform is responsible for conducting various diagnostic tests on the released package, including monkey testing, performance testing (startup, memory, CPU, lag, etc.), competitive testing, UI testing, weak network testing, etc. However, even with the ability to test dozens or even hundreds of machines simultaneously through cloud testing platforms, local testing still cannot cover all device models and user paths.

To release new versions safely and stably, we need to first limit the installation and trial of a small number of users. This is known as a gray release. The data validation platform is responsible for collecting application data from the gray and online versions, which may include performance data, business data, user feedback, and external public opinion.

Therefore, the efficiency of gray release is primarily affected by the following two factors:

Testing efficiency. Although gray release only affects a small number of users, we still need to ensure the quality of the application to avoid user churn. The release testing time of the testing platform is the first factor that affects release efficiency. We hope to determine whether the pending release package meets the online standard within one hour.
Data validation efficiency. The comprehensiveness, real-time nature, and accuracy of the data all affect the evaluation decision-making time of the gray version. It determines whether to stop the gray release, further expand the number of users in the gray release, or directly release to all users. For core data, real-time monitoring at the hourly or even minute level is necessary. For example, WeChat can evaluate performance data within 1 hour after release, while it takes 24 hours to evaluate business data.

On the other hand, if we want to cover 10,000 users in our gray release, how long does it take for enough users to download and install it? The ability of the distribution channels also greatly affects the efficiency of gray release. In China, there are several major gray distribution channels.

In China, due to the lack of a unified app store, the efficiency of gray distribution channels is indeed a very serious problem. Even for WeChat, without using the “red dot notification” as a powerful tool, the number of users reached in the gray release per day may be less than 100,000. In the international market, there’s Google Play, a unified app store, where gray releases can be conducted through Google Beta. However, the time required for GP review needs to be considered in version releases. Currently, GP review speed has increased compared to before, usually taking only one to two days.

Through gray release, we can collect performance and business data of a small number of users for the new version in advance. However, it is not suitable for accurately evaluating the quality of business data. This is mainly because gray release users are selected, and generally, more active users are prioritized for upgrade.

2. Dynamic Deployment

For gray releases, the biggest pain point in the whole process is still the coverage speed of the gray package. And traditional gray release methods also have a very serious problem: the inability to roll back. “Once the package is released, it is like water poured out,” if a critical problem occurs, it may also result in user churn in the gray release.

The Tinker dynamic deployment framework is designed to solve this problem. We hope that Tinker can become a new way of releasing, replacing traditional gray releases and even formal version releases. Compared to traditional release methods, hotfixing has many unique advantages.

Fast. If we use the traditional release method, it would take WeChat 10 days to cover 50% of users. With hotfixing, more than 80% of users can be covered within one day, and over 95% within three days.
Rollbackable. When a patch has major issues, it can be rolled back in a timely manner, allowing users to return to the base version and minimizing losses.

To improve the efficiency of patch releases, WeChat has also developed the TinkerBoots management platform. The TinkerBoots platform not only supports parameter settings such as the number of people and conditions, but it can also select specific Xiaomi models to roll out patches to only 10,000 people. The platform also connects with the data validation platform to achieve automated controlled release, automatically monitor changes in core indicators, and ensure release quality.

Tinker has been released for more than two years. Although hotfix technology can solve many problems, as the author of Tinker, I must admit that it has had some negative impact on Android development in China.

Users are the best testers. Many teams no longer believe in pre-testing platforms. They believe that since there is dynamic deployment with rollback capability, it is not scary to have quality issues; they can just release more patches.

Poor performance. As mentioned in the “High-Quality Development” section, hotfixing, componentization, and other cutting-edge technologies have a significant impact on application performance, especially on startup time.

Nowadays, it seems that hotfixing cannot replace release; it is more suitable for use in the gray release of a small number of users. Unless major issues occur, patches targeting all users should not generally be released.

Componentization goes back to modularity, and hotfixing goes back to gray release. These are choices that many large-scale apps in China have had to make in order to improve performance. If we want to truly achieve “freedom of release,” it may be necessary to force a change in development mode, such as transitioning from componentization to web, React Native/Weex, or mini-programs to implement it.

A/B Testing #

As I mentioned earlier, APK gray release is an online verification of existing features, and it is not suitable for accurately evaluating the impact of a feature on business.

If you are not running experiments, you are probably not growing! - by Sean Ellis

Sean Ellis is the father of the “Growth Hacking Model (AARRR)”. A key idea mentioned in the growth hacking model is “A/B Testing”. Companies like Google, Facebook, and domestic companies like Toutiao and Kuaishou are very enthusiastic about A/B testing, hoping to make scientific and data-driven business and product decisions through testing.

So what exactly is A/B testing? How to use A/B testing correctly?

1. What is A/B Testing

Some of you may think that A/B testing is not complicated, it simply separates gray release users into two groups, A and B, and then compares the collected data to analyze the test results.

In fact, the difficulty of A/B testing lies in the definition of groups A and B. We need to ensure that the test plan is conducted on homogeneous groups of people at the same time, so that we can attribute the differences in indicator data to the product plan, and then select the winning version for release and achieve data growth.

Homogeneous. Refers to the consistency of various characteristic distributions of the population. For example, if we want to verify the impact of a product plan on the purchase intention of female users, then both the A and B versions should be tested on female users. Of course, characteristic distributions are not limited to gender, they also include country, city, usage frequency, age, occupation, new or old user, etc.
Same time. User behavior may vary at different time points. For example, during major holidays, user activity will increase. If the effect of version A occurs during holidays and the effect of version B occurs during non-holidays, it is obvious that such a comparison is unfair to version B.

Therefore, achieving “homogeneous and same time” is not that simple. First, we need rich and precise user profiling capabilities, such as version, country, city, gender, age, preferences, etc. In addition, we also need a powerful backend to complete test control, log processing, indicator calculation, statistical significance indicators, and other work.

After achieving “homogeneous and same time”, we need to find the significance indicators of the product plan, which is the goal that the plan wants to prove. For example, if we optimize the popup window’s wording to attract more users to click the button, then the click-through rate of the button is the significant indicator of this test.

After having all these, can we start testing happily? No, you still need to think about these two questions first:

Traffic selection. How much traffic should be allocated for this test? If you allocate too little traffic, you might not get accurate test results, and if you allocate too much traffic, you might face higher risks. At this time, you will need to use the Minimum Sample Size Calculator.

As an experimental statistical method, A/B testing involves a lot of statistical principles. This calculator requires us to provide the following values:

Baseline Indicator Value: Please enter the baseline value of the indicator you want to optimize in the test. For example, if the value fluctuates between 9% and 11% on a daily basis, you can enter 10%. - Minimum Relative Change in the indicator: Please enter the minimum relative change value that you consider meaningful. For example, if you want to optimize the indicator when it is at 10%, and you enter 5% in the minimum relative change, it means that you consider changes in absolute value between (9.5%, 10.5%) to be meaningless. Even if the test version is better at this time, you would not adopt it. - Statistical Power 1-β: If statistical power is set to 90%, it can be loosely understood that in A/B testing, when there is a statistically significant difference between Version A and Version B in a certain statistical indicator, the probability of correctly identifying that there is a significant difference between Version A and Version B is 90%. - Significance Level α: If the difference in the statistical indicator exceeds a specific difference, we say that the test result is significant.

Number of days selection. How many days should the A/B test last to be considered reliable? Generally, it can be calculated using the following method.

Number of days required for the test >= calculate number of users / (average daily traffic * traffic percentage for the test version)

2. How to Conduct A/B Testing

Although major companies all have their own comprehensive A/B testing platforms, scientific design of A/B testing is not easy and requires continuous learning, and extensive practical experience.

First of all, all A/B tests should be “premeditated”, meaning we need to have corresponding expectations and design every aspect of the test. Generally, before starting the test, we need to answer the following questions:

In order to evaluate the correctness of tracking, traffic allocation, and statistics, as well as to increase the credibility of the test results, we will also add A/A testing while conducting A/B testing. A/A testing is the “twin brother” of A/B testing, and some internet companies also call it the “idle test”. It is mainly used to evaluate the scientific nature of our test. Generally, I recommend using the A/A/B testing method.

So what are the A/B testing solutions for Android clients?

To sum up, A/B testing can be described as “getting A/B test data is easy, getting reliable A/B test data is difficult”. Therefore, indicator design, group selection, time design, and plan design all need to be carefully considered. We also need to think and practice more. I recommend you to read “5 Common Mistakes in Mobile App A/B Testing” for further reading.

Unified Release Platform #

I have mentioned many times in my column that despite our extensive optimizations, we are still limited by various constraints of native development. In this case, we can consider breaking free from these constraints, such as using Web, React Native/Weex, Mini Programs, and Flutter, which can all be solutions to the problem.

However, even if we switch to a new development model, the steps for staging and release are still indispensable. In addition, we also need to deal with various operational activities, push notifications, and configuration distribution.

Many colleagues from big companies may have experienced the frustration of facing various release platforms, which can easily lead to operational accidents if not careful. So how should we standardize the release process and avoid distribution accidents?

1. Release Platform Architecture

Each application involves more or less the following types of releases, but often these releases are placed in various platforms without unified management and standardized processes, which can easily cause accidents due to operational errors.

A unified release platform should centrally manage all data distribution processes for applications and establish strict processes for staging releases.

Management. All releases must go through permission verification and require approval. “Who releases, who is responsible,” and a strict accident classification system must be established. Personnel who cause accidents due to negligence should be subject to classification handling.
Staging. All releases must go through staged testing and gradually expand their impact on users. However, it must be acknowledged that some distributions are not easy to test, such as the much-discussed incident of “changing display styles for Christmas.” For operational activities that take effect at specific times, it is difficult to perform staged testing in the production environment.
Monitoring. The unified release platform needs to interface with the application’s “real-time data platform” so that remedial measures can be taken in a timely manner when problems occur.

Running a business is already challenging enough. If a splash screen activity is distributed and causes crashes for all users, the losses to the application would be immeasurable. Therefore, standardized processes and regulations can to some extent prevent problems from occurring. Monitoring is equally important because it helps us identify problems promptly and take immediate measures to contain them.

2. Dealing with Operational Accidents

Every day, there are a large number of releases of various types, and it feels like walking on a knife’s edge. It is inevitable to encounter problems when operating online services, but what measures can be taken to salvage the situation?

Startup Safety Protection. At the very least, we must ensure that users can start the application even if there are issues with operational configurations.
Dynamic Deployment. If the application can start normally, we can solve the problem through hot patching. But hot patching also has certain limitations, such as 3% to 5% of users being unable to apply patches successfully, and it is not immediately effective.
Remote Control. In the application, we need to keep a “life-saving” command channel. Through this channel, we can execute commands such as deleting certain files, fetching patches, and modifying configurations. But to avoid the failure of our own channel, this control command needs to support both our own channel and vendor channels. For example, if a distributed resource causes the application to start with a blank screen, we can use the command to delete the problematic resource files.

“High-rise buildings rise from level ground.” It is difficult to build perfect internal systems such as testing platforms, release platforms, or data platforms in a short period of time. They evolve by optimizing countless small details internally. Therefore, we need to remain patient and constantly work towards improving organizational effectiveness.

Summary #

In the past, A/B testing used to be difficult and time-consuming. Every aspect of the testing process went through rigorous refinement. However, with the advent of powerful A/B testing systems, the cost of testing has been greatly reduced and the efficiency has been improved. As a result, many products and developers have started to abuse A/B testing or use it as an excuse to avoid thinking.

Whether it is performance-related A/B testing led by developers or business-related A/B testing led by product teams, many times it lacks rigorous scrutiny. Often, it requires multiple rounds of testing to obtain a “conclusion,” and even then, there is no guarantee that it is reliable. Therefore, whether it is A/B testing or daily gradual rollouts, it is important to have clear expectations and thoroughly examine each step. It is painful for everyone involved in the testing process to realize that the experimental settings are not reasonable or that several data points have been missed after the test is released, and then repeatedly make modifications.

On the other hand, our views on things are not set in stone. Even as the author of Tinker, I believe that it is just a product designed to solve specific needs at a certain stage. However, no matter how the development model changes, our pursuit of quality and efficiency will not change.

Homework #

Is your app a fan of A/B testing? Do you have any good or bad experiences with A/B testing? Feel free to leave a comment and discuss with me and other classmates.

Today’s homework is to think about the pain points and optimization opportunities in the gray release process of your product or company. Please write down your thoughts in the comments.

Feel free to click “Invite a Friend to Read” and share today’s content with your friends, inviting them to learn together. Finally, don’t forget to submit today’s homework in the comments. I have prepared generous “Study Booster Packages” for students who complete the homework seriously. Looking forward to progressing together with you.