10 Common Mistakes and Solutions Multiple Verifications Issues and Learning Effectiveness

10 Common Mistakes and Solutions_Multiple Verifications Issues and Learning Effectiveness #

Hello, I’m Bo Wei.

In the previous lesson, we talked about a common problem in A/B testing. Now, based on my years of experience in A/B testing, I have selected some misconceptions that are frequently encountered in practical business scenarios. These misconceptions mainly involve multiple testing issues, learning effects, Simpson’s paradox, and the independence of experimental and control groups.

These four misconceptions can actually be seen as several common problems in practical business scenarios. However, in the title, I emphasized that they are misconceptions because you are likely to have some biases in understanding these issues.

Therefore, in the next two lessons, I will follow the format of “problem statement - problem analysis - summary and extension - post-class reflection” to explain to you. In other words, I will first delve into the causes of the problems, then provide examples to analyze the manifestations of these problems in practice, and finally give corresponding solutions.

After all, by understanding the principles of the problems first and then learning about their manifestations and solutions, not only can you achieve better learning outcomes, but you can also adapt and apply them flexibly according to the ever-changing business scenarios in practical applications.

Multiple Testing Problem #

The multiple testing problem, also known as the multiple comparison problem, refers to the increase in the type I error rate (α) and the impact on the accuracy of the results when comparing multiple tests simultaneously. I have mentioned this problem multiple times in the basic section, such as in the 4th lesson when discussing the benefits of A/B testing and in the 7th lesson when discussing when to check the test results.

Why is multiple testing a problem? #

To understand why multiple testing is a problem, we need to start with the type I error rate (α) (also known as the false positive rate or significance level, which is the predetermined value before the test, usually 5%). As I mentioned in the 2nd lesson, the type I error rate refers to the probability of inferring a difference between two groups when, in fact, the two indicators are the same, or the probability of obtaining a significant result by chance. Moreover, the convention in statistics is 5%.

5% may seem like a small probability event, but what if we simultaneously compare 20 tests? Just think about it, if the probability of a type I error occurring in each test is 5%, what is the probability of at least one type I error occurring in these 20 tests?

It is not easy to directly calculate the probability of this event, but we can calculate the complement of this event, which is the probability of no type I error occurring in these 20 tests, and then subtract this complement from 100%.

Let’s use P(A) to represent the probability of event A occurring. P(type I error in each test) = 5%, so P(no type I error in each test) = (1 - 5%) = 95%, therefore P(no type I error in 20 tests) = 95% raised to the power of 20.

In this way, we can calculate the probability:

P(at least one type I error occurring) = 100% - P(no type I error in 20 tests) = 100% - (95% raised to the power of 20).

The probability of at least one type I error occurring is also called FWER (Family-wise Error Rate). The calculated probability is 64%. This means that when comparing 20 tests simultaneously, the probability of at least one type I error occurring among these 20 results is 64%. See, isn’t this a significant probability? In fact, as the number of tests increases, this probability will become higher and higher, as you can see from the graph below.

The blue line and orange line in the graph represent the changes in FWER when α is 5% and 1% respectively. Based on this graph, we can draw two conclusions:

As the number of tests increases, FWER, which is the probability of type I error occurrence, will significantly increase.
When α becomes smaller, FWER becomes smaller, and the increase rate is slower.

The first conclusion explains the problem brought by multiple testing. The second conclusion actually provides us with a potential solution: reducing α.

This means that when we compare multiple tests simultaneously, the probability of type I error occurrence (FWER) increases, which becomes a potential multiple testing problem.

When do we encounter the multiple testing problem? #

You might say that in practice, I run one test at a time, so I won’t encounter this problem, right? Actually, it is not. The problem of multiple testing is much more common in practice than you might think, and it mainly appears in four forms.

The first form is when there is more than one experimental group in an A/B test.

When we want to change more than one variable and have sufficient sample size, we don’t need to wait for the completion of one variable test before testing the next one. Instead, we can test these variables simultaneously by assigning them to different experimental groups.

Each experimental group changes only one variable, and the results are analyzed by comparing each experimental group with the common control group. This testing method is also called A/B/n testing. For example, if I want to improve the effectiveness of an advertisement by changing its content, background color, font size, etc., I would need three corresponding experimental groups, and I would compare each of them with the control group separately.

This means that three tests are conducted simultaneously, resulting in a multiple testing problem.

The second form is when there is more than one evaluation metric in an A/B test.

This is easy to understand because when analyzing the test results, we actually compare the evaluation metrics of the experimental group and the control group. If there are multiple evaluation metrics, multiple tests will be conducted, leading to the multiple testing problem. The third form is when you conduct segmentation analysis while analyzing the results of an A/B test based on different dimensions.

When analyzing test results, depending on business needs, we may not be satisfied with only comparing the experimental group and control group as a whole.

For example, for a multinational company, many A/B tests are conducted simultaneously in multiple countries around the world. In this case, if we want to see the specific impact of the changes in the A/B test on different countries, we will analyze the results by country, comparing the metrics of the two groups within each country. In this situation, analyzing the test results for each country is one test, and multiple countries represent multiple tests.

The fourth form is when you continuously check the experimental results during the A/B test process.

I mentioned this situation in Lesson 7. Because the test is still ongoing, each time we check the results, it will be different from the previous check. Each time we check the results, it counts as one test, which also leads to the multiple testing problem.

Now that we understand the manifestations of the multiple testing problem in practice, how do we solve it in practice?

How to solve the multiple testing problem? #

First, I want to mention in advance that the methods I’m about to introduce only apply to the first three forms. As for the solution to the fourth form, I have already discussed it in Lesson 7, which is to avoid checking the results prematurely while the A/B test is still ongoing. We must wait until the sample size requirement is met before calculating the results. Therefore, I will not go into detail on this here.

Given the ubiquity of the multiple testing problem, many statisticians have proposed their own solutions, which can be roughly classified into two categories:

Keeping the p-values of each test unchanged and adjusting alpha.
Keeping alpha unchanged and adjusting the p-values of each test.

Why do we make these two types of adjustments?

In Lesson 2, we introduced that when judging the significance of the hypothesis test results using p-values, we compare the calculated p-value with alpha. We consider the results significant when p-value < alpha.

Therefore, we either adjust alpha or adjust the p-value. As I mentioned earlier, reducing alpha is one way to solve the problem. The most commonly used method to adjust alpha is called the Bonferroni correction. It’s quite simple, just set alpha as alpha/n.

Here, n is the number of tests. For example, if alpha is 5%, when we compare 20 tests, the adjusted alpha would be 5%/20 = 0.25%. At this time, FWER (Family-Wise Error Rate) = \(1-(1-0.25%)^{20}\) = 4.88%, which is close to our original alpha of 5%.

The Bonferroni correction is popular in A/B test practices because it is simple to apply. However, this method only adjusts alpha, and it takes a one-size-fits-all approach to different p-values, which makes it somewhat conservative. It can be applicable when the number of tests is small.

Based on practical experience, when the number of tests is large (e.g., hundreds of tests, which often happens in A/B testing when conducting segmentation analysis by different dimensions, such as markets for multinational companies), the Bonferroni correction can significantly increase the Type II error rate (beta). In this case, a better solution is to adjust the p-values by controlling the False Discovery Rate (FDR).

The principle of controlling FDR is quite complex, and I won’t go into detail here. Just remember that it refers to a class of methods, among which the most commonly used one is the Benjamini-Hochberg (BH) procedure. The BH procedure takes into account the magnitude of each p-value and makes different adjustments. The general adjustment approach involves sorting the calculated p-values in ascending order and adjusting different p-values based on their ranks. Finally, the adjusted p-values are compared with alpha.

In practice, we usually use tools like Python for calculations. Python has a powerful function called multipletests, which includes various methods to correct for multiple testing, including the Bonferroni correction and BH procedure discussed today. We only need to input the different p-values and select the correction methods, and this function will output the adjusted p-values for us.

To summarize, although the Bonferroni correction is simple, it is overly strict and conservative. Therefore, in practice, I would recommend using the BH procedure to adjust the p-values.

Now that we’ve finished discussing the multiple testing problem, let’s talk about another common issue in A/B testing—learning effect.

Learning Effect #

When we want to test obvious changes through A/B testing, such as changing the interaction interface and functionality of a website or product, the old customers of those websites or products often have adapted to the previous interface and functionality, and it takes them some time to adapt and learn the new interface and functionality. Therefore, the behavior of old users during the learning and adaptation phase is often different from usual, which is called the learning effect.

What are the manifestations of the learning effect in practice? #

Depending on the changes, old users have different reactions during the learning and adaptation period, which can generally be divided into two categories.

The first category is a positive reaction, also known as the novelty effect, which refers to the strong curiosity of old users towards change and their willingness to try it out.

For example, changing the color of a click button from a cool tone to a bright red color may temporarily increase metrics like click-through rates. However, when users have adapted to the new bright red color, the long-term metrics may return to the previous level.

The second category is a negative reaction, also known as change aversion. This refers to the confusion and even resistance that old users may experience towards change.

For example, in an e-commerce website that you frequently visit, the “add to cart” function used to be located at the top left of the screen. But after a change in the interface, the “add to cart” function is now located at the bottom right of the screen. In this case, you may need to spend some time looking for it on the screen, and you may even get frustrated and close the page without finding it, leading to a decrease in short-term metrics.

It can be imagined that these different reactions during the learning and adaptation period are generally short-term, and in the long run, these short-term reactions will gradually fade away. However, it is important to note that these short-term learning effects can indeed interfere with the results of A/B testing, making the results overly positive or negative. So, how can we detect the learning effect in a timely manner and eliminate the interference caused by the learning effect?

How to detect the learning effect? #

In practice, there are mainly two methods that can be used to detect the learning effect.

The first method is to observe the changes in metrics of the experimental group over time (measured in days).

In the absence of learning effect, the metrics of the experimental group should remain relatively stable over time.

However, when there is a learning effect, because it is a short-term effect that gradually fades over time, the metrics of the experimental group (the group with changes) will experience a slow change over time until it stabilizes.

If it is a novelty effect, the metrics of the experimental group may initially show a rapid increase and then gradually decrease over time.
If it is change aversion, the metrics of the experimental group may initially show a rapid decrease and then gradually increase over time.

When using this method, it should be noted that although we observe the changes in metrics of the experimental group over time, it is not necessary to compare the sizes of the experimental and control groups every day. If comparisons are made daily, the issue of multiple testing, as mentioned earlier, may arise. Remember, comparisons between the two groups can only be made after reaching a sufficient sample size and analyzing the test results.

The second method is to only compare the metrics of new users in the experimental and control groups.

The learning effect is caused by old users adapting to new changes, so for new users who log in for the first time during the experiment, the issue of “learning to adapt to new changes” does not exist. Therefore, we can first identify new users in both groups (if random assignment was used, the proportion of new users in both groups should be similar), and then calculate the metrics separately for new users in both groups, and finally compare these two metrics.

If we do not obtain significant results in the comparison of new users (with a sufficient sample size of new users), but obtain significant results in the overall comparison, it indicates that this change does not affect new users, but does affect old users, which likely indicates the presence of the learning effect.

In practice, we can use the above methods to detect the learning effect. However, to truly eliminate the influence of the learning effect and obtain accurate experimental results, it is still necessary to extend the testing period and compare the results of both groups after the learning effect in the experimental group has subsided.

Summary #

In today’s lesson, we focused on two common experimental pitfalls in A/B testing: multiple testing and the learning effect. I explained the principles behind these issues, various manifestations in practice, and their corresponding solutions in detail.

However, I would like to emphasize the problem of multiple testing. It presents itself in various forms, making it particularly common in A/B testing. When I first encountered A/B testing, I already knew about the existence of this problem, but at that time, I only knew that it would occur in A/B/n testing. Later on, I discovered that multiple testing issues can also arise when conducting segmentation analysis.

Fortunately, we discovered this issue in a timely manner, ensuring that our entire testing effort was not wasted. Looking back now, I realized that my knowledge about multiple testing was limited to its existence and a few of its manifestations. I was not clear on why this problem occurs and when it is likely to happen. As a result, I failed to recognize new manifestations of the problem when they appeared.

This is what I want to emphasize: knowing why the problem occurs, detecting it, and solving it are equally important.

Reflection Questions

Based on your own experience, think about whether you have encountered multiple testing issues and the learning effect in A/B testing. How did you handle them at the time?

Feel free to share your learnings and in-depth thoughts in the comments. If today’s content helped answer some of your questions, please click “Share with Friend” and learn and grow together with them. Thank you for listening, and see you in the next lesson.