08 Case Series Building a Standard Ab Test Framework From Scratch

08 Case Series_Building a Standard AB Test Framework from Scratch #

Hello, I’m Bowen.

After studying the previous lessons, I believe you have not only mastered the statistical principles of conducting A/B tests but also understand what a standardized A/B testing process looks like and the key points to note in each step.

The content of today’s lesson, overall, will not be too difficult. It mainly involves using a case study on improving retention rate for a music app to recap the statistical knowledge we have learned and the core steps of conducting A/B tests.

During the process of studying this lesson, on one hand, if there are still some concepts you haven’t fully grasped, you can review them again in a targeted manner to fill in any gaps. On the other hand, the content of the previous lessons was quite extensive, so today’s case study will help you organize your thoughts, clear your mind, and then effectively absorb the knowledge in the advanced section.

Alright, let’s go through the process using the case study of the music app below.

Setting Goals and Hypotheses for A/B Testing Based on Business Issues #

In today’s case study, the product we’re looking at is a music app. Users can enjoy millions of songs without ads by paying a monthly fee. Apart from the basic music playback feature, the product manager has also designed many convenient functions for this app. For example, users can add their favorite songs to a playlist, create different playlists, and even download songs for offline listening, etc.

Through data analysis, the data scientists have found that users who use these convenient functions tend to have a higher retention rate than the average. This indicates that these convenient functions do help improve user retention. However, there has been a persistent problem bothering the team: although these functions are convenient and useful, their adoption rate has been low. Low adoption rate, in the long run, will inevitably affect user retention.

The team uncovered the reason through user surveys.

Due to the app’s preference for clean and simple page design, these functions are usually hidden within the feature list of each song, requiring users to click twice to access them: first by clicking on the feature list, and then by clicking on the specific product function. On one hand, many users, especially new users, have not discovered these functions. On the other hand, the two-step process for accessing the functions does not provide a good user experience, leading to user unsubscribe over time.

So, our goal now is very clear: increase the adoption rate of product functions by users.

How can we increase this adoption rate? You might suggest displaying all the functions directly so that users can see them at a glance, thus improving their adoption rate. The product manager initially had this idea, but later realized that displaying all the functions would make the song interface look cluttered and worsen the user experience.

Since changing the interaction interface of the product was rejected, can we proactively inform users about how to use these functions?

For example, informing new users about each function right after they register and log in. However, this idea was quickly rejected by the product manager, as new users do not need to use all the functions immediately after logging in. This is understandable because without the need, new users would not have any reaction when they see these functions. Therefore, new users usually skip the introduction to product features when they first log in.

Previous A/B tests also confirmed this. Informing users only when they have a specific need for a certain function proved to be most effective.

Therefore, the team’s hypothesis is: inform users about relevant usage functions through pop-up windows only when they have a demand, in order to increase the adoption rate of these functions. This way, it can avoid disturbing every new user and satisfy users with specific needs, striking a balance.

Determining Evaluation Metrics for A/B Testing #

Once the goals and hypotheses are defined, we can begin defining the evaluation metrics.

The team is planning to start with an A/B test for the “Add Favorite Music to Playlist” feature to validate the above hypotheses.

Since we want to inform the user only when there is a demand, we need a condition to trigger this notification. Therefore, our primary task is to determine the trigger condition: Only when the user has never used this feature (there is no need to inform the user if they already know about it), and they have listened to the same song x times (to determine their liking for a particular song), we will send them a pop-up notification.

After data scientists analyzed the data, the optimum value for x was determined to be 4. Therefore, the final trigger condition for the pop-up notification for this feature is:

The user has never used the “Add Favorite Music to Playlist” feature.
The user has listened to a particular song 4 times, and the pop-up will be triggered on the 5th play.

It should be noted that since the purpose of the pop-up is to inform the user and avoid repeated reminders, each user meeting the trigger condition should only receive the notification once; multiple triggers are not allowed.

Popup Notification

In this A/B test, users are randomly divided into an experimental group and a control group, with each group consisting of 50% of the users.

In the experimental group, if a user meets the trigger condition, the system will send them a pop-up notification (as shown in the above image).
In the control group, users will not receive the pop-up notification, regardless of whether they meet the trigger condition or not.

With the goals and hypotheses determined, let’s now define the evaluation metrics:

Usage rate of the “Add Favorite Music to Playlist” feature = Total number of users who used the “Add Favorite Music to Playlist” feature / Total number of users in the experiment.

Clearly, this is a probability-based metric, i.e., the probability of users in the experiment using the “Add Favorite Music to Playlist” feature. However, in order to make our evaluation metric more specific and facilitate calculations, we need to clarify two questions.

The first question is, how do we define “users in the experiment”?

Since the pop-up is only triggered when certain conditions are met, not all users in the experiment will be affected. Therefore, we cannot use all users assigned to the experiment, as it would introduce unaffected users (those assigned to the experiment but did not meet the trigger conditions), thereby reducing the accuracy of the test. So, it is important to note that the “users in the experiment” here should be the users who meet the trigger conditions (dashed line section in the diagram below).

In the experimental group, these are the users who receive the pop-up notification, and in the control group, they are the users who meet the trigger conditions (since users in the control group do not receive the pop-up notification, regardless of whether they meet the trigger conditions or not).- Users in the Experiment - The second question is, how do we determine the time window from when the pop-up is triggered to the final usage of the feature?

Since this A/B test aims to detect whether the pop-up will increase the usage rate of the relevant feature and each user triggers the pop-up at different times, a unified time window needs to be defined in advance to measure the usage rate, such as within x days after triggering. This standardization is to make the metric more clear and accurate.

Since the pop-up notification is time-sensitive, which means it is triggered when the user has a demand, if the user is influenced by the pop-up and uses the relevant feature, they will surely do so shortly after seeing the pop-up. Therefore, we set x to be 1, meaning the usage rate within 1 day after triggering.

With these two questions clarified, our final definition of the evaluation metric is as follows:- Usage rate of the “Add Favorite Music to Playlist” feature = Total number of users who used the “Add Favorite Music to Playlist” feature within 1 day after meeting the trigger condition / Total number of users in the experiment who meet the trigger condition

Just having the specific definition of the evaluation metric is not enough. In order to understand our evaluation metric better and obtain accurate test results, we also need to examine the variability of this metric from a statistical perspective.

Through retrospective analysis of historical data, it was found that the average probability of users using the relevant feature within 1 day after meeting the trigger condition is 2.0%. By applying statistical formulas, the 95% confidence interval for this metric was calculated to be [1.82%, 2.18%]. This means that if the values of both evaluation metrics for the two groups fall within this fluctuation range after the test is completed, it indicates that there is no significant difference between the two groups and falls within the normal fluctuation range.

Choosing the Experimental Unit #

After determining the evaluation criteria for the A/B test, the next step is to determine the experimental unit.

Since the pop-up window in this experiment is a visible change for users, and the evaluation criteria are based on users, we choose users as the minimum experimental unit, specifically by using user IDs, as these users must be logged in to enjoy the music service.

Calculating the Required Sample Size and Experiment Duration #

Next, we need to calculate the sample size required for the experiment. We first need to determine four statistical parameters:

Significance level (α).
Power (1 – β).
Pooled variance of the experimental and control groups \(\\sigma\_{\\text {pooled}}^{2}\).
Difference in evaluation metrics between the experimental and control groups, denoted as δ.

In general, the significance level in A/B testing is commonly set at 5%, and the power at 80%. We will follow these principles in our case as well. For the difference in evaluation metrics, based on the previously calculated volatility, a difference of at least 0.18% is considered statistically significant. Thus, we will use 0.2%. As for the pooled variance, as it is a probabilistic measure, we can calculate it directly using statistical formulas.

With these statistical parameters determined, we find that each the experimental and control groups need at least 80,700 users who meet the triggering conditions, for a total of 161,400 users. The data analysis shows that approximately 17,000 new users meet the triggering conditions each day. Therefore, this experiment will take approximately 10 days to complete.

Once we have completed the design work for the entire A/B test, it’s time to start the test, collect data, and analyze the results once the sample size reaches the expected level.

Analysis of Test Results #

After more than a week of waiting, our sample size has finally reached the standard, and we can now analyze the final results. However, before analyzing the results, we need to ensure that the A/B test conforms to our initial design during the implementation process to ensure the quality of the test. This is where we need to perform a reasonable test.

We use the most common fence criteria for the test.

Is the ratio of the sample sizes in the experimental/control group 50%/50%?
Is the distribution of features in the experimental/control group similar?

After analysis, it was found that this A/B test completely passed the rationality check of these two fence criteria, indicating that the experiment was implemented as expected.

So, now let’s start analyzing the experimental results in detail.

Experimental group: The sample size is 80,723, and the number of users who meet the trigger condition of using the feature within one day is 3,124, with a usage rate of 3.87%.
Control group: The sample size is 80,689, and the number of users who meet the trigger condition of using the feature within one day is 1,598, with a usage rate of 1.98%.

Based on the results, we obtained a P-value close to 0 and much smaller than 5%. At the same time, we calculated a 95% confidence interval for the difference in evaluation indicators between the two groups as [1.72%, 2.05%]. Since it does not include 0, it indicates that the usage rates of these two groups are significantly different. In fact, the usage rate of the experimental group is almost twice that of the control group, proving that the pop-up reminder when users need it does have an effect!

After obtaining this exciting result, the team decided to promote the pop-up reminder for the “add favorite music to playlist” feature to all users who meet the trigger condition, and also planned to conduct similar A/B tests for other features to validate their effectiveness. If everything goes well, all these pop-up reminders will be promoted, which will definitely increase the user retention rate in the long run!

Summary

Through this case study, you must have gained a more specific and deeper understanding of the key steps in conducting A/B tests.

So, this concludes the Basic Course. Next, we will enter the Advanced Course.

In the Advanced Course, I will explain more experiential and methodological knowledge to you. For some common problems in conducting A/B tests, I will explain their causes and provide solutions. For some commonly asked questions in interviews, I will share some problem-solving strategies based on my experience as an interviewer.

Finally, I want to emphasize that learning is always about repetition and continuity. The content of the Advanced Course will have a lot of connection with the Basic Course. Therefore, when studying the Advanced Course, I also hope that you can constantly review and reflect on the knowledge you have previously learned. When the course ends, take another look at the content of the Basic Course. I believe you will have a refreshing and rewarding feeling of “AB testing is so simple”.

Discussion Questions

Recall the A/B tests you have done or experienced before, did they have these basic process steps? If any steps were missing, which steps were omitted and why? If there are any other steps, please share them with me.

If you have a clearer understanding of the process and steps of A/B testing after completing today’s case study, you are welcome to click on “Please read, my friend” and share today’s content with your colleagues and friends. Let’s learn and grow together. Alright, thank you for listening. See you in the Advanced Course.