07 Analysis of Test Results Is Your Test Result Truly Reliable?

07 Analysis of Test Results_Is Your Test Result Truly Reliable? #

Hello, I’m Bowen.

After setting the goals and hypotheses, determining the metrics, selecting the experimental units, and calculating the required sample size, we have finally reached the last step of A/B testing: analyzing the test results.

Before we officially begin, let me ask you a question: Can we start analyzing the test results as soon as we receive them? Definitely not. We can only analyze the results if we are certain that they are reliable. In fact, analyzing A/B test results is not difficult; what’s difficult is obtaining reliable results that can provide the business with accurate guidance.

Why do I say this? Next, I will use an example of a music app aiming to improve user upgrade rates to break down the factors that can make test results unreliable. Then, we will explore how to analyze them in detail.

Case Introduction #

In general, music apps have two revenue models. One is to provide free music with ads in the app to make money through advertisements. The other is to offer a paid subscription service to users, providing high-quality ad-free music.

Our music app adopts both revenue models, but in terms of long-term profitability and user experience, the paid subscription model is more favorable. Therefore, we plan to have a promotion targeting free users in the app before and after Singles’ Day to attract them to become paying users.

Now, we have two advertising messages. In order to verify which one is more effective through A/B testing, they will be respectively placed in the experimental group and the control group:

Control group advertising message: Unlimited listening to millions of songs without ads, upgrade now and enjoy a free trial for six months!
Experimental group advertising message: From now until November 15th, upgrade now and enjoy a free trial for six months!

Now, let’s complete the overall design plan for A/B testing.

Determine the goal: To convert more free users into paying users.
Hypothesis: Adding a countdown in the advertising message, which creates a sense of urgency, can increase the conversion rate of free users.
Determine the experimental unit: User ID of free users.
Experimental group/control group: Randomly assigned, 50%/50%.
Evaluation metric: User conversion rate = Number of users who click on the ad and upgrade / Number of users who see the ad.
Range of fluctuation for evaluation metrics: [1.86%, 2.14%].

With the A/B testing framework designed, we can implement the A/B test and wait for the results. When can we view the test results and stop the A/B test? This is the first issue to be addressed to ensure the credibility of the test results.

When can we view the test results? #

Do you remember the formula we used in the last class to calculate the minimum sample size required to achieve significant results in a test?

The time required for an A/B test = Total sample size / Sample size obtained per day.

Combining this formula with the daily traffic of free users in the app, we can calculate that this test theoretically needs to run for 10 days.

However, this formula is only derived in theory. In the actual practice of A/B testing, we need to consider factors such as the periodic variation of the metrics when determining the test duration.

If a metric exhibits strong periodic variation, such as significant differences between weekdays and weekends, then the test duration should include at least one full cycle to eliminate the impact of periodic variation on the metrics.

In the case of the music app, we found through historical data that there are more user upgrades on weekends compared to weekdays. This indicates that the evaluation metric of user upgrade rate exhibits periodic variation on a weekly basis. Therefore, our test should run for at least 7 days. Since we have already calculated that this test needs to run for 10 days based on the minimum sample size, which includes one full cycle, we can confidently set the test duration to 10 days.

Let me add one more thing: if the calculated test duration is less than one cycle, it is better to calculate it based on one full cycle as a precautionary measure.

However, during the actual process of the test, there may be a situation where significant differences in the evaluation metrics occur before the expected time. In this case, you need to be careful because ending the test prematurely will nullify the previous efforts. Let me explain it in more detail.

Let’s assume that the data analyst responsible for this test is doing an A/B test for the first time and is particularly excited. They are observing the experiment and calculating the test results every day. On the 6th day of the experiment (before reaching the expected sample size), they notice a significant difference in the evaluation metrics between the experimental group and the control group. The data analyst starts to wonder if the test has been successful ahead of schedule because they have achieved statistical significance before the estimated time.

The answer is, of course, negative.

On one hand, since the sample size is constantly changing, each observation of the test can be considered as a new experiment. According to statistical conventions, A/B tests generally have a 5% Type I error rate α, which means that on average, out of 100 repeated tests, you would get 5 false positive results of statistical significance.

This means that if we have more observations, the probability of observing false positive results of statistical significance will significantly increase. This is a manifestation of the multiple testing issue. I will explain the multiple testing issue in detail in the 9th class.

On the other hand, prematurely observing statistically significant results means that the sample size has not reached the minimum estimated size. Hence, this so-called “statistically significant result” is highly likely to be a false positive. “False positive” refers to the situation where the two groups are actually the same, but the test result wrongly indicates significant differences between them.

Therefore, the data analyst should not end this test prematurely and still needs to continue observing the experiment.

But if the test has already run for 10 days and the sample size has reached the previously calculated size, can we start analyzing the results of the A/B test then?

The answer is still no.

As the saying goes, haste makes waste. In order to ensure that the experiment is carried out as designed without any bugs that could compromise the statistical integrity, we need to perform a sanity check before formally analyzing the experiment results, thus ensuring the accuracy of the results.

In the 3rd and 4th classes, we learned that in order to prevent bugs that could compromise the statistical integrity during the implementation process, we can use the guardrail metrics to ensure statistical quality. In this case, we can use the ratio of the sample sizes in the experimental and control groups, as well as the distribution of features in the experimental and control groups as the two guardrail metrics. This is the second issue we need to focus on to ensure the reliability of the test results.

Statistical Quality Assurance for A/B Testing #

Testing the Proportion of Experimental and Control Group Samples #

Our assumption is that the experimental and control groups each account for 50% of the total sample size. Now let’s see if there have been any changes during the experiment.

The proportion of samples in each group relative to the total sample size is also a probability and follows a binomial distribution. Therefore, the specific operation method (see relevant content on the fluctuation of indicators in Lesson 4) is as follows:

First, calculate the standard error using the formula \(\\sqrt{\\frac{p(1-p)}{n}}\).
Then, construct a 95% confidence interval centered around 0.5 (50%).
Finally, confirm whether the actual sample ratio of the two groups falls within the confidence interval.

If the overall proportion falls within the confidence interval, it indicates that even if the overall proportion is not exactly 50%/50%, it is very close, which falls within the normal fluctuation range. Therefore, the sizes of the two groups’ samples meet expectations. Otherwise, it indicates a problem with the experiment. How can we confirm and solve the potential problem?

Returning to our A/B test, we have 315,256 samples in the experimental group and 315,174 samples in the control group. Using the formula, we calculate the standard error: The result is 0.06%. We construct a 95% confidence interval [50%-1.960.06%, 50%+1.960.06%] = [49.88%,50.12%], which represents the fluctuation range of the sample ratio. We then calculate that the sample ratio of the overall experimental group to the control group is 50.01%/49.99%.

As we can see, both sample ratios fall within the confidence interval, indicating normal fluctuations. This means that the sample sizes of the two groups meet the expectation of an equal split. The test for the proportion of experimental and control group samples has passed, and now we can proceed to test the distribution of features in the experimental and control groups.

Testing the Distribution of Features in the Experimental and Control Groups #

In A/B testing, the data in the experimental and control groups should be similar in order to make valid comparisons. We can use the comparison of feature distributions between the two groups to determine their similarity.

Commonly used features include basic information such as age, gender, location, or any features that may affect the evaluation indicators. For example, in the case of a music app, we can also examine the users’ normal activity levels. If there is a significant difference in the proportion of these features between the two groups, it indicates a problem with the experiment.

Once a problem is found in the validity test, do not rush to analyze the experimental results as they are likely to be inaccurate. What we need to do is find the cause of the problem, solve it, and then implement an improved A/B test.

There are two main methods for finding the cause:

Check the implementation process with engineers to see if there are any deviations or bugs specific to the implementation of the two groups.
Analyze the existing data from different dimensions to see if there is a bias in a specific dimension. Common dimensions include time (days), operating systems, device types, etc. For example, from the dimension of operating systems, check if there is a bias in the proportion of iOS and Android users between the two groups. If so, it indicates that the cause is related to the operating system.

Through data analysis, we find that the distribution of important features in these two groups is basically the same, indicating that the two groups of data are similar. This means that we have passed the validity test, and we can now analyze the results of the A/B test.

Finally, I want to emphasize that both of these validity tests are crucial for ensuring the quality of the experiment. If either of these tests fails, the experimental results will be inaccurate. Specifically, a mismatch in the sample ratio of the experimental and control groups when the experiment design is different will result in a Sample Ratio Mismatch problem. If the feature distributions of the experimental and control groups are dissimilar, it will lead to a Simpson Paradox problem. We will focus on discussing these two types of problems in Lesson 11.

How to analyze the results of an A/B test? #

In fact, analyzing the results of an A/B test mainly involves comparing the evaluation indicators of the experimental group and the control group for significant differences. So, what does “significant” mean? In fact, “significant” means excluding random factors and using statistical methods to prove that the differences between the two groups are factual rather than coincidental fluctuations.

So, how do we do it specifically?

Firstly, we can calculate the relevant statistical values using hypothesis testing in statistics, and then analyze the test results. The most commonly used statistical values are p-value and confidence interval.

You may ask, there are various tests in hypothesis testing, which test should I choose to calculate p-value and confidence interval? Here, we don’t need to understand the complex theoretical explanations of these tests. We just need to be familiar with the usage scenarios of the three commonly used test methods in practice:

Z-test

When the evaluation indicator is a probability indicator (such as conversion rate, registration rate, etc.), the Z-test is generally used (sometimes referred to as the proportion test in A/B testing) to calculate the corresponding p-value and confidence interval.

T-test

When the evaluation indicator is a mean indicator (such as average usage time, average usage frequency, etc.), and when it can be approximated to a normal distribution with a large sample size, the T-test is generally used to calculate the corresponding p-value and confidence interval.

Bootstrapping

When the distribution of the evaluation indicator is complex and cannot be approximated to a normal distribution even with a large sample size (such as 70% of user’s usage time, OEC, etc.), the Bootstrapping method is generally used to calculate the p-value and confidence interval based on the definitions of p-value and confidence interval (for specific methods, please refer to the relevant content of metric volatility in the third lesson).

Now we have obtained the following test results:

Experimental group: Sample size is 315,256, there are 7,566 upgraded users, and the upgrade rate is 2.4%.
Control group: Sample size is 315,174, there are 6,303 upgraded users, and the upgrade rate is 2.0%.

Because the fluctuation range of the evaluation indicator is [1.86%,2.14%], we can conclude that the upgrade rate of 2.4% in the experimental group is not within the normal range, and it is likely to be significantly different from the control group.

Next, we can analyze this test result using the p-value method and the confidence interval method to verify whether our hypothesis is correct.

P-Value Method #

First, we can use the P-value method with the help of some calculation tools. Common options include Python, R, and online tools like this website. You can choose the specific tool based on your preference. Personally, I prefer using R for calculation:

results <- prop.test(x = c(7566, 6303), n = c(315256, 315174))

Since the evaluation metric, user upgrade rate, belongs to the probability category, we choose the function prop.test specifically designed for probability metrics.

Through calculation, we obtain a P-value < 2.2e-16:

According to statistical conventions, we generally set the significance level α to 5% (a statistical convention) and compare the calculated P-value with 5%. When the P-value is less than 5%, it means that the two groups have a significant difference. When the P-value is greater than 5%, it means that the two groups do not have a significant difference. If you are not very clear about this concept, you can review the content of hypothesis testing in the second lecture.

From the above results, we can see that the P-value is much smaller than 5% and close to 0, indicating that the two groups have a significant difference. This means that the experimental group’s advertising slogan can indeed increase the upgrade rate of free users.

Confidence Interval Method #

In the third lecture, we learned how to construct a confidence interval for metrics. Now we want to compare whether the evaluation metrics of the experimental group and the control group are significantly different, that is, whether the difference between the two is zero. In this case, we need to construct a confidence interval for the difference \(\\left(p\{\\text{test}} - p\{\\text{control}}\\right)\) of the two metrics.

We can also use software like Python and R to calculate the confidence interval specifically. Of course, you can also use the specific functions I introduced in the second lecture. Here, we will use the prop.test function in R again.

In fact, when we used this function to calculate the P-value above, R also calculated the 95% confidence interval: -

From the graph, we can see that the 95% confidence interval is [0.0033, 0.0047].

Next, we need to compare whether the two metrics have a statistically significant difference, that is, whether this confidence interval includes 0.

We know that values within the confidence interval are normal fluctuations. If the confidence interval includes 0, it means that the difference between the two metrics may also be 0, indicating that the two metrics are equal. If the confidence interval does not include 0, it means that the difference between the two metrics is not 0, indicating that the two metrics are significantly different.

Clearly, the confidence interval [0.0033, 0.0047] does not include 0, which means that our test result is statistically significant. In terms of business, compared to the control group’s advertising slogan (“Unlimited listening to the music library without ads, user upgrade, free trial for half a year!”), the experimental group’s advertising slogan with a sense of urgency (“Starting from today until November 15, user upgrade, free trial for half a year!”) can attract more users to upgrade, which verifies our initial hypothesis.

Up to this point, we find that both the P-value method and the confidence interval method can be used to analyze the statistical significance of A/B test results. So, how do we choose between them in practical applications? Are there any differences between the two?

In fact, in most cases, these two methods are interchangeable, and choosing one is sufficient. However, if we need to consider the relationship between the post-implementation benefits and costs, we must choose the confidence interval method.

When considering the relationship between benefits and costs, in addition to meeting the statistical requirement of significance (the two metrics are not equal, and the confidence interval of the difference does not include 0), we also need to ensure that the result is significant in business terms (the difference between the two metrics, δ, must be greater than or equal to the breakeven difference, \(\\delta\{\text{breakeven}}\), and the range of the confidence interval of the difference must be greater than \(\\delta\{\text{breakeven}}\)).

Summary #

In this lesson, we mainly explained how to analyze results in A/B testing. Based on practical experience, I have summarized three main points for you:

Don’t be impatient. You must wait until you reach a sufficient sample size before analyzing the test results.
Before analyzing the results, you must conduct a sanity check to ensure the quality of the test. Otherwise, if bugs occur during the implementation process, all efforts will be in vain.
Based on the characteristics of the metrics and data, choose the correct analysis method to derive conclusions that can drive the business.

In the field of data, there is a saying: “Garbage in, garbage out.” This means that if garbage is put in, garbage will be produced. This saying also applies to A/B testing: if A/B testing is not set up properly or if problems occur during the implementation process, incorrect results and conclusions will be obtained, resulting in incalculable losses for the business.

Therefore, in the previous four lessons, we talked about how to set up experiments, and today we spent a lot of space introducing how to ensure reliable results. All of this was to pave the way for “analyzing test results”.

Well, today’s test for the music app has yielded significant results, and everyone is happy. But what if the results are not significant? We will discuss this issue in detail in Lesson 9!

Thought-provoking Questions #

What other indicators of reasonableness can be used to support the validation of analysis results? Why?

Please feel free to share your thoughts and answers in the comments, and let’s discuss and exchange ideas together. If you find this lesson helpful, please share it with your friends and invite them to learn together.