11 Common Mistakes and Solutions Independence of Control Groups in Simpson's Paradox and Experimental Comparison Groups

11 Common Mistakes and Solutions_Independence of Control Groups in Simpson’s Paradox and Experimental Comparison Groups #

Hello, I’m Bowei. In this lesson, we will continue to learn about common misconceptions and solutions in A/B testing.

Today, we will address the issues of Simpson’s Paradox and the independence of experimental and control groups, which are also common in A/B testing practices.

Regarding Simpson’s Paradox, I have encountered it so many times that every time I conduct a detailed analysis of the results of an A/B test, I first check whether the proportions of the subgroups in both groups match the overall proportions of the two groups to ensure the accuracy of the experimental results.

As for the problem of the independence of experimental and control groups being violated, I also frequently encounter it when conducting A/B tests with fixed marketing budgets in the early stages. However, its manifestations are actually very diverse, and it can be found in various business types, so targeted analysis is needed.

After hearing about my experiences, you may still not fully understand what these two problems are and how they affect A/B testing. Don’t worry, today I will deeply analyze them for you, guide you to identify them in practice, and solve them!

Simpson’s Paradox #

When you hear the concept of “Simpson’s Paradox,” you may feel a bit confused and unsure about what specific issue it refers to. So, let me first give you an example using a music app to explain what Simpson’s Paradox is and what it means in A/B testing.

A music app optimized the registration process for new users and wanted to validate whether the conversion rate after optimizing the registration process increased through A/B testing in the two major markets of Beijing and Shanghai.

Experimental group: Used the optimized registration process.
Control group: Used the original registration process.

After designing the experiment, the sample size was evenly distributed between the experimental and control groups. After running the experiment and obtaining enough experimental samples, the results showed a conversion rate of 1.44% for the experimental group and 2.02% for the control group.

This is quite surprising. Why is the conversion rate of the experimental group (using the optimized registration process) lower than that of the control group (using the original registration process)?

What’s even more surprising is that when analyzing these two markets separately, you will find that the conversion rate of the experimental group is higher than that of the control group in both Beijing and Shanghai.

After confirming that the data and calculations are correct, your instinctive reaction may be: Even with the same data, the results obtained from overall analysis and segmented analysis are completely opposite. How is this possible!

This is the famous Simpson’s Paradox in mathematical theory. It means that when the composition of multiple data groups is uneven, comparing multiple data groups as a whole and comparing them separately in each segmented field may lead to opposite conclusions. Mathematically, it can be expressed more abstractly as: even if \(\\frac{a}{b}<\\frac{A}{B}, \\frac{c}{d}<\\frac{C}{D}\), it is possible that \(\\frac{a+c}{b+d}>\\frac{A+C}{B+D}\) holds true.

It’s really strange that this counterintuitive phenomenon can be completely valid in mathematics. And this incredible phenomenon also occurs in A/B testing.

The reason lies in the uneven distribution of the experimental and control groups in Beijing and Shanghai, although they achieved an equal sample size overall, as required in the experimental design.

Experimental and control groups should be independent of each other #

Firstly, it should be noted that A/B testing has a prerequisite: the experimental units in the experimental and control groups should be independent of each other. This means that the behavior of the units in each group in the test is only influenced by the experience of their respective group and should not be affected by other groups. This prerequisite is also known as the Stable Unit Treatment Value Assumption (SUTVA).

Regarding the independence of the experimental and control groups, many people may think: what’s there to say about it? Since they are divided into two groups, they are definitely independent of each other. However, it is not always the case in practice. In A/B testing, we often unintentionally break the independence between the two groups due to encountering certain business scenarios in practice. This can compromise the accuracy of the experimental results.

This is because A/B testing is fundamentally about causal inference. Therefore, only when the experimental and control groups are mutually independent and do not interfere with each other, if there are significant differences in the test results, these differences can be attributed to the changes in the experimental group compared to the control group. Otherwise, it is difficult to establish an accurate causal relationship.

You may not fully understand it yet, so let’s look at how the independence between the two groups of experimental units is compromised in specific business scenarios in A/B testing.

What are the manifestations of compromising the independence of the two groups? #

In A/B testing, the independence of the two groups is primarily compromised in three types of businesses: social networking/communication, sharing economy, and shared resources. Let me explain each of them to you.

The first type of business is social networking/communication.

This type of business mainly involves user interactions and information exchange, with typical examples including WeChat, Weibo, LinkedIn, voice/video communication, email, and so on.

In these businesses, network effects exist. Network effects mean that adjacent nodes in a network influence each other. If node A is in the experimental group and its adjacent node B is in the control group, then they are no longer independent.

Let me give you an example to help you understand. Suppose a social networking app has improved its recommendation algorithm for the information feed, aiming to increase user interactions by recommending more relevant content. Now, we want to test the effectiveness of this algorithm improvement through A/B testing.

Control group: Using the old algorithm.
Experimental group: Using the new and improved algorithm.
Evaluation metric: Average usage time of users.

In this case, we can make a hypothesis: users in the experimental group, like user A, experience the improved algorithm, see more content they like, spend more time on the app, and share more interesting content, resulting in increased interactions with friends. User B, who is a friend of user A, happens to be in the control group. When B sees the content and interactions shared by A, B may also spend more time browsing the app and participate in the interaction with A, even though B did not experience the improved algorithm.

This is a typical example of network effects in A/B testing. User A in the experimental group changes their usage behavior due to the changes in the A/B test, and this behavioral change is transmitted to their friend B in the control group through network effects, thereby affecting the usage behavior of user B. This contradicts the experimental design of providing the old algorithm experience to users in the control group, which could lead to both groups showing increased metrics, resulting in inaccurate results.

The second type of business is the sharing economy.

The sharing economy is generally a two-sided market, where companies provide a trading platform and users act as both suppliers and demanders. Typical examples include Taobao, Didi, Uber, shared bicycles, shared rentals, Airbnb, etc. In this type of business, due to the dynamic equilibrium between supply and demand, any change on one side will inevitably cause changes on the other side, thus affecting the independence between the two groups in the experiment. For example, when we use A/B testing to validate the effectiveness of different optimizations, we often can only test one optimization at a time. If we use A/B testing to verify an optimization on the demand side, we need to divide the demand into an experimental group and a control group. As the experimental group is optimized, it will lead to an increase in demand. Under a certain supply condition, more supply will flow to the experimental group, resulting in a decrease in supply to the control group. The user experience of the control group will become worse, which further undermines the demand for the control group.

Let me give you an example. Let’s say a shared ride-hailing service has optimized the ride-hailing process on the app, and now we want to verify the effectiveness of this optimization through A/B testing.

Here, the experimental group still uses the optimized process, while the control group uses the old process. The optimized process makes it more convenient for users in the experimental group to book a ride, attracting more drivers. However, since the number of drivers is stable, this will result in a decrease in the number of available drivers for the control group. It will be more difficult for users in the control group to book a ride, leading to a poorer user experience. Therefore, the effect of the process optimization obtained through A/B testing may be overestimated compared to the control group.

The third type of business is resource-sharing business.

Some resource-sharing businesses have fixed resources or budgets, with the most common example being advertising marketing.

In the case of a fixed marketing budget, we use A/B testing to verify the effectiveness of different advertisements. If we find that the improved advertisement in the experimental group has better performance and a higher click-through rate, it will result in a reduction of the advertising budget for the control group, affecting the effectiveness of the control group’s advertisements. Since most online advertisements are paid based on the number of clicks, the experimental group will spend more money on advertisements, thereby occupying the control group’s budget under the condition of a fixed marketing budget. From this perspective, the advertising performance of the experimental group obtained through A/B testing may be overestimated.

How to avoid undermining the independence of the two groups? #

So, from the three types of businesses I just mentioned, you can see that in actual business scenarios, there are many manifestations and reasons for violating the independence of the two experimental groups. Therefore, there are different methods to solve this problem, but the general principle is to exclude interference between the two groups through different forms of separation. Specifically, there are four main methods of separation:

The first method is geographical separation.

This method is mainly applicable to offline services affected by geographical locations, such as shared transportation and shared rentals. Different regions usually do not have interference between each other in localized services, so they can be classified into different markets.

The most common approach is to classify users in Beijing as the experimental group and users in Shanghai as the control group, which can exclude interference between the two groups. It should be noted that the different markets selected here should be as similar as possible and have comparability. When I say similar, it includes but is not limited to: the development of the business in the local area, the economic situation, and the population distribution.

The second method is resource separation.

This method is mainly applicable to interference caused by shared resources between the two groups. The specific operation is that the allocation ratio of resources for each group in the A/B test should be consistent with the ratio of the sample size in each group. For example, in advertising marketing, if we compare the effectiveness of different advertisements in the experimental group and control group through A/B testing, the allocation ratio of advertising budget for each group should be equal to the ratio of the sample size. For example, when the sample size is evenly distributed between the two groups, the advertising budget should also be evenly distributed. This way, the advertising budget between the two groups will not interfere with each other.

The third method is time separation.

This method is mainly applicable to changes that are not easily perceived by users, such as algorithm improvements. The principle of this method is that the experimental group and control group are the same group of users, and they experience changes for a period of time, giving them the experimental group experience, and then experience no changes for another period of time, giving them the control group experience.

It should be noted that the unit of this time period can be minutes, hours, or days. In this way, because all the experimental units belong to the same group at the same time, there is no interference between different groups. However, when using this method, it is important to note that user behavior may fluctuate at different times of the day or on weekdays/weekends. If there are periodic fluctuations, comparisons should ideally be made at the same stage in different cycles, such as comparing weekdays with weekdays and weekends with weekends, but not weekdays with weekends.

The fourth method is separation through clustering.

This method is mainly suitable for social networking businesses. The connections between users in social networks are not uniformly distributed but have different levels of closeness. Therefore, by using modeling methods to determine different clusters based on the degree of interaction between different users, each cluster will have closely connected users. We can randomly assign each cluster as an experimental unit, which can reduce interference between different groups to some extent.

This method is more complex and difficult to implement, requiring support from data modeling and engineering teams. If interested, you can refer to the experiences of Google and LinkedIn in this area.

Summary #

In this lesson, I specifically explained two common experimental pitfalls: Simpson’s paradox and the independence of the experimental/control group. I also provided commonly used solutions in practice. I believe that through this lesson, you will be able to promptly identify and resolve these pitfalls in your own practice.

Furthermore, I have summarized the content of these two lessons into a single image, which is included in the document. You can save it for future review and reinforcement. -

With this, we conclude our series on common experimental pitfalls. Through these two lessons, you have experienced the complexity and variability of real business scenarios, as well as the various potential pitfalls and problems that can be accumulated through trial and error in A/B testing. This highlights the importance of practice in A/B testing.

If you can identify and resolve these common potential pitfalls in A/B testing in a timely manner, it not only helps to recover potential losses for the company and achieve sustainable growth, but also serves as an important indicator that distinguishes you from A/B testing beginners.

Thought Question #

Based on your own experience, think about whether you have encountered situations in A/B testing where the Simpson’s Paradox and the independence of the experimental/control groups were compromised. How did you handle it at that time?

Please share your thoughts and findings in the comments section so that we can learn and discuss together. If this series of misconceptions has helped answer some of your difficult questions, feel free to share the course with your colleagues and friends. See you in the next lesson.