09 if Test Results Are Not Significant How to Improve

09 If Test Results are Not Significant_How to Improve #

Hello, I’m Bowei.

After studying the “Basics” section, you have grasped the overall process of A/B testing, which means you can design an A/B test based on these processes. However, in the actual implementation process, you will inevitably encounter some problems due to the complexity of the business, failure to strictly follow standardized procedures, or the properties of the data itself.

That’s right. As I mentioned at the beginning, A/B testing is highly practical, and you need to be able to identify potential pitfalls and find corresponding solutions. Therefore, in the next three lessons, I will share my accumulated experience and the pitfalls I have encountered, so that you can avoid mistakes in practice.

Today, let’s start with a painful question. In Lesson 7, we learned how to obtain reliable test results and how to analyze them. We successfully obtained significant differences in the metrics between the control group and the experimental group. You may have been pondering a question: I designed an A/B test according to this process, but why are my experimental results insignificant? Can I conclude that “the metrics of the two groups are actually the same”?

In today’s lesson, we will delve into the question of what to do when test results are not significant, a problem that many people frequently encounter.

Why is “the experimental results are not significant” happening? #

Firstly, we need to understand why “the experimental results are not significant.” There are two reasons.

The variation in the A/B test has no effect, so the metrics of both groups are essentially the same.
The variation in the A/B test is effective, so the metrics of both groups are essentially different. However, due to the small magnitude of the change, the sensitivity of the test, also known as power, is insufficient to detect the difference in metrics between the two groups.

If it is the first reason, it proves that this variation has no effect on product/business optimization. In this case, we should consider abandoning this variation or testing new variations.

If it is the second reason, we can optimize and adjust from the perspective of A/B testing. Specifically, by increasing power, we can increase the probability of detecting different experimental results in the A/B test. As discussed in Lesson 6, the larger the power, the more accurately we can detect differences between the experimental group and the control group. Therefore, when we increase power and still find that the test results are not significant, we can conclude that the metrics of the two groups are essentially the same.

How can we increase power?

Let’s review the sample size estimation formula mentioned in Lesson 6: \(\\mathrm{n}=\\frac{\\left(Z\{1-\\frac{\\alpha}{2}}+Z\{1-\\beta}\\right)^{2}}{\\left(\\frac{\\delta}{\\sigma\{\\text {pooled}}}\\right)^{2}}=\\frac{\\left(Z\{1-\\frac{\\alpha}{2}}+Z\{\\text {power}}\\right)^{2}}{\\left(\\frac{\\delta}{\\sigma\{\\text {pooled}}}\\right)^{2}}\) Where: - \(Z\{1-\\frac{\\alpha}{2}}\) is the \(Z\) score corresponding to \(\\left(1-\\frac{\\alpha}{2}\\right)\). - \(Z\{\\text {power}}\) is the \(Z\) score corresponding to power. - \(\\delta\) is the difference between the evaluation metrics of the experimental group and the control group. - \(\\sigma\_{\\text {pooled}}^{2}\) is the pooled variance of the experimental group and the control group.

In the formula, we identify the factors that affect power, which are sample size and variance. Specifically:

Sample size is directly proportional to power. Increasing the sample size can increase power.
Variance is inversely proportional to power. Decreasing the variance can increase power.

In practice, when it is feasible to obtain a larger sample size, one can choose to increase the sample size to increase power, which is relatively simple and easy to implement. If there are limitations on traffic or time, and it is not possible to obtain more samples, one can increase power by reducing variance.

Next, I will explain six specific methods to increase power from the perspectives of increasing sample size and reducing variance.

How to increase Power by increasing sample size? #

In practice, there are three main methods used to increase sample size: extending the testing time, increasing the proportion of testing traffic to total traffic, and sharing the same control group among multiple tests.

Extending the testing time #

You are certainly familiar with extending the testing time, as I mentioned it when discussing sample size estimation in Lesson 6. The amount of traffic available for testing is fixed each day, so the longer the testing time, the larger the sample size. Therefore, if conditions permit, you can extend the testing time.

Increasing the proportion of testing traffic to total traffic #

Let’s assume a product receives 10,000 traffic per day. If I want to conduct an A/B test, I won’t use 100% of the traffic; generally, I will use a certain percentage of the total traffic, such as 10%, which represents the proportion of testing traffic to total traffic.

Why not use the entire traffic?

On one hand, A/B tests have a cost of trial and error. Although the probability is low, any changes we make during the test could potentially harm the business. Therefore, the less traffic used, the lower the cost of trial and error, and the safer it is.

On the other hand, in the era of big data, for internet giants, they already have massive traffic. Any significant changes they make to their products have the potential to become news.

For example, when testing whether to add a new feature, companies don’t want to reveal this feature to users during the testing phase to avoid causing user confusion. Therefore, they generally start with a very small proportion of traffic for A/B testing (e.g., 1%) and gradually distribute the changes from the A/B test to 100% of the traffic after obtaining significant results.

Therefore, while keeping the testing time constant, increasing the proportion of testing traffic to total traffic can also achieve the goal of increasing the sample size.

Sometimes, we may run multiple A/B tests simultaneously on the same product. For example, if we want to increase the click-through rate of push notifications, we might change the title, content, timing, and audience of the push notification.

For these four different influencing factors, each factor change is, in fact, an independent A/B test. In theory, we would need to design four experiments, each with its own experimental group and control group.

Suppose the total available traffic we currently have is 80,000. Then each group will have 10,000 traffic. However, you will find that the utilization rate of traffic is too low because each experiment’s control group is the same (the original push notification). But if we combine the four control groups into one, it becomes four experimental groups and one control group, with each group having 16,000 traffic.

As you can see, when simultaneously verifying multiple changes on the same basis, that is, running multiple A/B tests with the same control group, we can merge the control groups and reduce the number of groups, thereby increasing the sample size for each group. This type of testing is called A/B/n testing.

In summary, in practice:

If time permits, the most commonly used method is to extend the testing time because it is the easiest to implement.
If time is limited, you can prioritize increasing the proportion of testing traffic to total traffic to save time.
When multiple tests are conducted simultaneously, and the control groups are the same, you can combine multiple control groups.

Increasing sample size to improve Power is the most common method in practice. However, business scenarios vary, and although uncommon, sometimes it is indeed impossible to obtain more samples, such as when time is tight and 100% of the total traffic is already being used, yet the results are still not significant. In such cases, increasing Power by reducing variance becomes necessary.

How to Improve Power by Reducing Variance? #

There are three commonly used methods to reduce variance in practice: reducing the variance of indicators, propensity score matching, and calculating indicators at the triggering stage.

Reducing the Variance of Indicators #

There are two ways to reduce the variance of indicators.

The first way: maintaining the original indicators and reducing variance by removing outliers.

If we find obvious outliers in the distribution of indicators through the histogram of indicators, we can remove the outliers by setting a capping threshold.

For example, based on the distribution of indicators, we can select only the range of values that covers 95% and remove the outliers that account for the remaining 5%. Common indicators, such as average spending per user in e-commerce or average listening time per user in a music app, may have outliers due to a small number of users who spend a lot or music enthusiasts who listen to music all the time, thereby increasing variance.

The second way: choosing indicators with lower variance.

Indicators with a narrow range of values have smaller variance than those with a wide range. For example, the variance of the number of clickers is smaller than that of the number of clicks (because one clicker can generate multiple clicks, and there are more clicks than clickers, resulting in a wider range of values); the variance of purchase rate is smaller than that of average spending per user (because purchase rate represents a binary event of buying or not buying, with only two possible values, while average spending per user can be any amount of money with theoretically infinite range of values); the variance of listening rate is smaller than that of average listening time per user, and so on.

As we can see, for similar behaviors (such as buying, listening to music, watching videos, etc.), probability indicators have smaller variance than mean indicators. Therefore, if we want to reduce variance, we can transform mean indicators into probability indicators that represent similar behaviors, that is, modify the original indicators and choose indicators with a narrow range of values, while meeting business requirements.

Propensity Score Matching (PSM) #

Propensity Score Matching (PSM) is a method of causal inference used to address the issue of uneven distribution between the experimental group and the control group.

You may still remember from Lesson 7 that conducting a test of reasonableness before analyzing the results. So, what is the relationship between the test of reasonableness and PSM?

Let me summarize. If the test of reasonableness helps us determine whether the distributions of two groups are similar, then PSM helps us identify retrospectively the similarities between the two groups. Simply put, the more similar the characteristics of the two groups, the smaller the variance.

The basic principle of PSM is to find matching data points in one group that are similar to those in the other group, based on the propensity score of each data point. If you don’t understand the propensity score, it’s okay, you only need to know this: the closer the propensity score, the more similar the two data points. Here, a data point refers to one experimental unit in A/B testing.

The specific steps of PSM are as follows:

First, put the various characteristics of each data point in the two groups to be matched (such as gender, age, geographic location, features of using products/services, etc.) into a logistics regression model.
Then, calculate the propensity score for each data point and use methods such as nearest neighbor to perform matching.
Lastly, we only need to compare the similar parts of the two matched groups.

PSM’s principle is somewhat complex, and I provided some reference links for you to check. Fortunately, there are corresponding libraries in major programming languages such as Python and R, such as “pymatch” in Python and “Matching” in R, which make the implementation relatively easier.

In this part of propensity score matching, you only need to remember one conclusion: PSM can effectively reduce the variance of the two groups. By comparing the similar parts of the two groups after propensity score matching, we can examine whether the results are significant.

Calculating Indicators at the Triggering Stage #

In A/B testing, the process of randomly assigning experimental units to groups is called “assignment”. At the same time, you need to know that in some A/B tests, such as the case in Lesson 8, the changes being tested need to meet certain conditions to be triggered.

Therefore, from the perspective of assignment and triggering, A/B testing can be divided into two types.

Changes do not require condition triggering. After being assigned to the experimental group, all users can experience the changes in A/B testing.
Changes require condition triggering. Among all the users assigned to the experimental group, only those users who meet certain conditions will trigger the changes in A/B testing.

Most A/B tests in practice belong to the first type, and they are relatively easy to understand.

However, please note that the method we discussed here about reducing variance is only applicable to the second type of A/B testing. Simply put, when calculating metrics, we only consider the users in each group who meet the triggering conditions (highlighted with yellow circles), not all users in each group (highlighted with green circles). This may be a bit difficult to understand, so let me give an example to explain.

Do you remember in Lesson 8 we talked about the A/B test design that used pop-ups to inform users about the new feature “add favorite music to playlist”? Not all users in the experimental group will receive the pop-up reminder in the A/B test design.

Therefore, in order to avoid disturbing irrelevant users and to send the pop-up to users who need this feature, we have predefined the triggering rules for the pop-up:

The user has never used the “add favorite music to playlist” feature before.
The user has listened to a particular song 4 times, and the pop-up will be triggered when they play it for the 5th time.

So when we calculate the usage rate of the “add favorite music to playlist” feature in the case study, it is equal to the total number of users in each group who used the feature divided by the total number of users in each group who met the pop-up triggering rules.

The denominator, “the total number of users in each group in the experiment”, only counts the users who meet the pop-up triggering rules, not all the users assigned to each group. This is what we mean by calculating metrics at the triggering stage.

Here it is important to note that there may be users in the control group who meet the pop-up triggering rules as well. However, because they are in the control group, we will not send the pop-up to them even if they meet the triggering rules. We still need to include those users in the calculation of metrics because we need to use them as the denominator to calculate the evaluation metrics.

For those who are familiar with data tracking, you may wonder: for the control group users who meet the pop-up triggering rules but do not trigger the pop-up, there is no related record in the data, how do we record them in the data?

There is a trick in the engineering implementation: for control group users, if they meet the triggering rules, we will send them an invisible pop-up with only one pixel, so that we will have relevant records in the data, which makes it easier for subsequent metric calculations. At the same time, it ensures that the control group is not affected by the pop-up.

By considering only the users who meet the pop-up triggering rules as the denominator in calculating evaluation metrics, we can exclude the noise in the data (users assigned to the experiment but did not trigger the pop-up), thereby reducing variance.

This type of A/B testing that requires triggering is more common in businesses with fixed user usage paths, such as e-commerce. In e-commerce, users generally have a clear multi-level usage path: entering the website/app -> browsing the product list -> viewing specific products -> adding to cart -> purchasing.

In A/B testing in e-commerce, users are usually assigned to the experimental group or the control group when they enter the website/app. If we are testing a new feature on the “shopping cart” page, only users who have entered the “shopping cart” page can trigger the changes in the A/B test.

Overall, there are not many cases where we improve power by reducing variance (the common approach is to increase the sample size to improve power). If you do encounter such a situation, a simple and quick method is to reduce the variance of the metric. Of course, if conditions permit, I still recommend a more scientifically rigorous method such as propensity score matching (PSM). So for changes that require triggering in A/B testing, it is necessary to calculate the metrics at the triggering stage.

Summary #

In order to address the issue of non-significant results in A/B testing, in this lesson we mainly discussed six methods to improve the power of A/B testing. I categorized them into two main types based on their principles:

You can flexibly apply these methods based on my introduction to each method and when to choose which method.

If after trying these methods, the test results are still not significant, it means that the metrics of the two groups are actually the same. In this case, we should abandon the changes in this A/B test and use other changes to optimize the business and product.

Lastly, I would like to emphasize that making a change that can truly improve the business is not easy. According to the experimental data disclosed by big companies like FLAG in the United States, the probability of getting truly significant results from A/B testing and implementing changes is less than one-third.

So don’t be discouraged. Although not every experiment will have significant results, you can learn new knowledge from each experiment (such as whether the change actually has an effect on the business), accumulate new methodologies, and discover potential issues in the business, data, or engineering. The growth in personal skills and the improvement of business processes are all very valuable.

Thought Question #

Have you encountered a situation in which you did not obtain significant results in an A/B test? How did you handle it at that time, and did you gain any valuable experience from the experiment?

Feel free to leave a comment and discuss in the comment section. You can also click “Share with a Friend” to share today’s content with your colleagues and friends, and learn and grow together. Alright, thank you for listening, see you in the next lesson.