12 Situations Where Ab Testing Is Inappropriate

12 Situations Where AB Testing is Inappropriate #

Hello, I am Bowen.

We know that A/B testing is a powerful tool for companies to achieve continuous growth. However, no method can solve all problems and provide an everlasting solution. A/B testing is no exception.

A/B testing can solve most causal inference problems. However, in some cases where causal inference is involved, A/B testing is not applicable. In such situations, we need to find alternative approaches and methods to solve the problem.

Therefore, in today’s lesson, we will learn under what circumstances A/B testing is not suitable, and what corresponding solutions are available.

When is A/B testing not applicable? #

In practice, there are three main situations where A/B testing is not applicable:

When it is not possible to control the variables being tested #

A/B testing is a controlled variable experiment, and one of its prerequisites is that we must be able to control the changes in the variables being tested. This allows us to give different user experiences to the experimental group and the control group. However, in some cases, we are unable to control the changes in variables.

You may wonder, are there such variables?

Of course, there are. The main variables that we can control are in the product and business aspects, but when it comes to the personal choices of users, we have no way and it is also impossible to control them. After all, users have free will, so all our marketing methods are aimed at persuading users, but the ultimate choice still lies with the users.

For example, if we want to understand the changes in user behavior after switching from QQ Music to NetEase Cloud Music, the variable we want to test is changing the music app. It is important to note that we cannot help users decide whether to switch music apps, so we cannot achieve true randomization of the groups.

You may say that we can use marketing to offer incentives or even pay users to switch music apps, which is feasible in practice, but it will introduce new biases in the experiment. Different users will have different reactions to external incentives, and we may only study users who respond to external incentives, while ignoring those who do not. This will result in inaccurate experimental results.

When there are major events being released #

Major events refer to the release of new products/businesses or changes involving the product’s image, such as changes in trademarks or spokespersons. In these cases, we often cannot conduct A/B testing. This is because for any major event release, we want as many users as possible to know about it, and we also spend a lot of marketing money on it. In the current era of highly developed information flow on the Internet, it is not possible for a new product I release to be known to only a small number of users, even for small and medium-sized enterprises.

For example, Apple’s annual product launch event does not and cannot conduct large-scale user A/B testing in advance to see how the new product performs before deciding whether to release it.

Another example is if a company wants to change its trademark, it cannot pre-group users and expose the new trademark to the experimental group and the old trademark to the control group. This is because a trademark represents the image of a company or product. Just think about it, if users were grouped, there would be multiple trademarks of the same product circulating in the market at the same time, which would confuse users and is not conducive to building a product image.

When there are only a few users #

This is actually quite easy to understand. If we do not have a sufficient amount of traffic to achieve the required sample size in a short period of time, then A/B testing is no longer applicable. However, this situation is relatively rare in the big data-driven internet industry, so we will not delve into it here.

What are the alternatives when A/B testing is not applicable? #

When A/B testing is not applicable, we usually use two types of methods as alternatives: non-experimental causal inference methods and user research methods. These provide new ideas and approaches for causal inference when you want to do it but cannot conduct A/B testing.

Propensity Score Matching (PSM) #

Propensity Score Matching (PSM) is a commonly used non-experimental causal inference method. I introduced PSM in Lesson 9. Its essence is to artificially (rather than randomly as in experiments) construct similar experimental and control groups from historical data using model-based methods, and then compare the two groups .

Here, I will explain in detail how PSM is applied in causal inference using a case study of a music app.

This music app has a subscription model with two options:

  • Personal Subscription: $10 per month for individual use only.
  • Family Subscription: $20 per month, with a maximum of 5 simultaneous users.

In addition, both personal and family subscriptions offer a 3-month free trial period for new users.

Through extensive data analysis, a data analyst found that the long-term retention rate (i.e., renewal rate) of family subscription users is higher than that of personal subscription users. It is easy to understand because a family subscription allows sharing with others, so there are generally more users in each subscription, usually more than one. The more users in a subscription, the less likely they are to cancel, resulting in a higher long-term retention rate.

Based on this analysis, the data analyst recommended to the marketing manager to target personal subscription users with advertisements promoting the benefits of family subscriptions, encouraging them to upgrade.

However, the marketing manager had a different opinion: users who choose family subscriptions and users who choose personal subscriptions are fundamentally different. For example, their user profiles and usage behaviors have significant differences. In other words, it is not the upgrade itself that leads to improved user retention, but rather the fact that they are already different types of users, resulting in different retention rates.

To verify the marketing manager’s idea, the data analyst conducted a detailed analysis of the user profiles and usage behaviors of the two subscription options and found that, as the marketing manager said, users who upgraded from personal to family subscriptions were significantly different from those who did not upgrade. For example, upgraded users had a higher average age and longer usage time, among other differences. image

By now, you probably understand that whether upgrading from personal to family subscriptions can improve user retention is actually a causal inference problem.

The data analyst’s viewpoint is that “upgrading from personal to family subscriptions” as the cause can lead to the result of “improved user retention”.

However, the marketing manager’s opinion is that there are many factors that influence user retention, and under the scenario of user upgrading, other factors cannot be ruled out because upgrading is the user’s own choice. It is highly likely that users who upgrade and those who do not upgrade are inherently different, so comparing only the upgrading factor without considering other dissimilar factors is not sufficient. Both viewpoints seem reasonable. So how do we verify who is right and who is wrong?

The best way to verify causal inference is by conducting A/B testing! However, in this business scenario, since the decision to upgrade is the user’s autonomous choice and we cannot control it, we cannot conduct a randomly assigned experiment. Therefore, in this case, the non-experimental causal inference method, Propensity Score Matching (PSM), can be used. The specific method is as follows.

First, we select trial period users who started their individual subscription within the same time frame from historical data.

Among the users who continue to pay after the three-month trial period, some remain individual subscribers while others upgrade to family subscriptions. Among these naturally formed two groups, we use PSM to match the users’ profiles and usage behaviors, etc., and select non-upgraded users who are similar to the upgraded users. We then compare the long-term user retention among these similar users:

  • Image 1

Next, after conducting PSM, let’s compare the user profiles and usage behaviors between individual subscriptions and family subscriptions:

  • Image 2

From the data, we can see that after PSM, the non-upgraded users and upgraded users have become very similar in various characteristics. Now we can make comparisons. As we have already controlled for other similar features, the only difference between the two groups is whether they upgraded. If there is a change in user retention, it indicates that the change factor is the upgrade.

Finally, let’s take a look at the final comparison results. In the graph below, the y-axis represents user retention rate, and the x-axis represents the month starting from the trial period. Since the trial period is three months and there are no renewal issues during the trial period, the retention rate is 100%. Therefore, we start calculating user retention rate from the fourth month:

  • Image 3

From the graph, we can see that if we don’t use PSM, as the data analyst initially discovered, upgrading from individual subscription to family subscription can increase the one-year retention rate by 28%. However, this result is not accurate enough because other factors have not been eliminated (according to the view of the marketing manager).

After conducting PSM, we find non-upgraded users similar to the upgraded users and discover that upgrading indeed improves user retention but only by 13%. This means that only 13% of the increase in user retention rate can be attributed to the user upgrade.

Here, through PSM, we have simulated a controlled variable experiment by eliminating the influence of other factors, thus determining the accurate impact of upgrading from individual subscription to family subscription on user retention.

User Research #

User research is suitable when A/B testing is not feasible, such as pre-evaluations before the launch of new products/businesses. In such cases, we can obtain information by directly or indirectly communicating with users to determine the impact of corresponding changes on them.

There are many methods of user research. Today, let’s mainly discuss several commonly used ones: Deep User Experience Research, Focus Group, and Survey.

Deep User Experience Research refers to extracting in-depth information from a small number of potential users. For example, eye-tracking research, which tracks users’ eye movements to understand their decision-making process, or diary studies based on users’ self-recorded usage experiences and intentions.

  • Eye-tracking research allows us to understand users’ normal usage processes and identify any points of lag or exits.
  • Diary studies rely on users’ self-recorded usage experiences and intentions to gather feedback.

Focus Groups are guided discussions where potential users are organized and led by a moderator to discuss different topics. Based on the different opinions expressed during the discussions, feedback is synthesized. From the format of group discussions, it can be seen that the number of users that can participate in a focus group is generally larger than that in deep user experience research, but smaller than that in surveys.

A Survey is designed in advance to obtain information about specific questions using options or open-ended questions. For example, the thoughts and feelings about a new product/business. These questions are compiled into a questionnaire and distributed to potential users. Communication can be done face-to-face, online, or through telephone, etc. Based on the responses from different users, the approximate feedback results can be summarized.

  • Image 4

From the graph, it can be seen that as we go from deep user experience research to focus groups, and then to surveys, although the number of users participating increases, the depth of information obtained from each user becomes shallower. Therefore, the choice of method depends on how many potential users can be recruited, whether there are corresponding conditions and equipment (such as eye-tracking research requiring eye-tracking devices), and the desired depth of information.

Summary #

In today’s class, we discussed the limitations of A/B testing. We introduced the non-experimental causal inference method - Propensity Score Matching (PSM) - through case studies, and gave a brief introduction to related methods in user research. In practice, the variable of user choice, which we cannot control, often occurs, and PSM is mainly used as a non-experimental causal inference method. User research can not only be used for evaluating new products/businesses, but also for generating ideas for new metrics (such as the combined qualitative and quantitative method I mentioned in Lesson 3 to determine metrics).

From the beginning to today’s class, our column explained the statistical principles, standard processes, and various common problems and solutions in practical A/B testing. Speaking of applying these experiences and methodologies, the workplace is naturally the best place, but there is also another practical opportunity, which is during job interviews.

In the next two lessons, I will take you through some commonly asked A/B testing questions during interviews. At the same time, I also recommend that you review the questions you have been asked during interviews and how you answered them at that time. This way, when we study the content of the last two lessons, we will be more targeted.

Reflection Question #

Based on your own experience, think about whether you have encountered or experienced situations where causal inference analysis was desired but A/B testing was not applicable. Please explain the reasons and outcomes in detail.

Feel free to share your thoughts and reflections on this lesson in the comments section. I will provide feedback as soon as possible.