02 Statistical Fundamentals Deep Understanding of Hypothesis Testing in Ab Tests

02 Statistical Fundamentals_Deep Understanding of Hypothesis Testing in AB Tests #

Hello, this is Bo Wei.

In the previous lesson about the statistical properties of A/B testing metrics, I briefly explained hypothesis testing with one sentence: selecting an appropriate test method to verify whether the hypothesis we propose in A/B testing is correct.

This statement is quite abstract, so in today’s lecture, we will delve into it to see what hypothesis testing is and how to use it to make inferences.

What is Hypothesis Testing? #

As the name suggests, hypothesis testing is about testing whether our proposed hypothesis is correct and whether it can stand up to the facts.

In statistics, it is difficult for us to obtain data about the entire population. However, we can obtain sample data and generate hypotheses about the population based on the sample data. So, what we refer to as hypothesis testing is actually the process of determining whether the hypothesis generated from the sample data holds true for the population (i.e., the facts).

In the context of A/B testing, a hypothesis generally refers to an inference about the size of metrics between the experimental group and the control group.

To help you understand hypothesis testing more vividly, in this lesson, I will start with a case study of a recommendation system and abstract the basic principles and related concepts of hypothesis testing from it, allowing you to learn theory through practice and apply theory to practice.

The recommendation system in a news app is an important component that can recommend content based on a user’s browsing history. Recently, the engineering team improved the algorithm of the recommendation system and wanted to verify the improvement through A/B testing.

The new algorithm is used in the experimental group, while the old algorithm is used in the control group. The effectiveness of the algorithm is represented by the click-through rate: the better the recommendation, the higher the click-through rate. Therefore, the hypothesis we propose is that the click-through rate of the experimental group (new algorithm) is higher than that of the control group (old algorithm).

You may be confused about whether the “hypothesis” we propose is the same as the “hypothesis” in hypothesis testing.

In fact, they are not exactly the same.

What are the “hypotheses” in hypothesis testing? #

Why is this the case? Because in hypothesis testing, the “hypotheses” refer to a pair of hypotheses: the null hypothesis and the alternative hypothesis, which are completely opposite to each other. In the context of A/B testing, the null hypothesis refers to the assumption that there is no difference between the metrics of the experimental group and the control group, while the alternative hypothesis refers to the assumption that there is a difference between the metrics of the experimental group and the control group.

To better understand the null hypothesis and the alternative hypothesis, let’s go back to the case of recommendation systems and convert the initial hypothesis into the null hypothesis and the alternative hypothesis in hypothesis testing:

The null hypothesis is that there is no difference in click-through rates between the experimental group and the control group.
The alternative hypothesis is that there is a difference in click-through rates between the experimental group and the control group.

You might wonder, wasn’t our initial hypothesis “the click-through rate of the experimental group is higher than that of the control group”? Why does the alternative hypothesis only state that there is a difference in click-through rates between the two groups, without specifying which one is larger or smaller?

To answer this question, we need to understand the concepts of one-tailed test and two-tailed test.

A one-tailed test specifies not only that the two objects being compared are different, but also explicitly states which one is larger or smaller, such as the metric of the experimental group being larger than that of the control group.
A two-tailed test specifies only that the two objects being compared are different, without explicitly stating which one is larger or smaller.

Going back to the initial hypothesis in the case of recommendation systems, we have already specified that the click-through rate of the experimental group is higher than that of the control group, so we should choose a one-tailed test. However, our alternative hypothesis has become a two-tailed test assumption that the click-through rates of the two groups are different. Why is that?

This is the difference between theory and practice, and it is also the reason why we feel that the theory of A/B testing is easy to grasp but the practice often encounters problems. Here, let me give you the conclusion first and then explain why. The conclusion is: In the practice of A/B testing, it is recommended to use a two-tailed test.

There are two main reasons why a two-tailed test is recommended.

The first reason is that a two-tailed test allows the data itself to play a greater role in decision-making.

When we use A/B testing in practice, we hope to use data to drive decision-making and minimize any subjective ideas that may interfere with the data. Therefore, a two-tailed test, which does not require us to explicitly state which one is larger or smaller, can better leverage the power of the data.

The second reason is that a two-tailed test helps us consider the positive and negative results of the changes comprehensively.

In practice, we expect that changes will result in improvements in the metrics. But what if the actual change in the metrics is exactly opposite to our expectation? This is where the advantages of a two-tailed test come into play. A two-tailed test can take into account both positive and negative results, which is closer to the complex reality. However, a one-tailed test can only apply to one of them, usually the positive effect we expect.

Therefore, it is precisely because we choose a two-tailed test that we only state in the alternative hypothesis that there is a difference between the two groups without specifying which one is larger or smaller.

What are the different types of “tests” in hypothesis testing and how should they be chosen? #

Now, we know that there are two types of “hypotheses” in hypothesis testing, the null hypothesis and the alternative hypothesis. But what about the “tests”?

In fact, there are many types of tests, classified from the perspective of the “hypotheses”. Besides the single-tailed test and the two-tailed test, common “tests” can also be classified based on the number of samples being compared. These include the one-sample test, the two-sample test, and the paired test. So, which test method should be chosen in practice?

The answer is: In A/B testing, the two-sample test should be used.

The reason is actually quite simple, let me explain the applicability of each of these tests and you will understand.

When comparing two sets of sample data, a two-sample test should be used. For example, in an A/B test, comparing the experimental group and the control group.
When comparing one set of sample data with a specific value, a one-sample test should be used. For example, if I want to compare whether the average daily usage time of Geek Time users has reached 15 minutes, I can compare one set of sample data (the daily usage time of a sample of Geek Time users) with a specific value (15).
When comparing changes in the same set of sample data before and after a specific event, a paired test should be used. For example, I randomly select 1000 Geek Time users, give them a discount of “1/10 off on all columns”, and then compare their average daily usage time before receiving the discount with their average daily usage time one month after receiving the discount.

At this point, you may ask, I have also heard of the T-test and the Z-test, how should these two tests be chosen in A/B testing?

The choice between the T-test and the Z-test mainly depends on the sample size and whether the population variance is known.

When the population variance is unknown, the T-test should be used.
When the population variance is known and the sample size is larger than 30, the Z-test should be used.

I have also created a chart for you, which will make it clear at a glance.

chart

As for the practical application of these theories in A/B testing, one rule of thumb is: T-test is generally used for mean-type indicators, while Z-test (proportion test) is generally used for probability-type indicators.

Why is that?

As I mentioned in the previous lesson, in the case of a large sample size, mean-type indicators follow a normal distribution. The calculation of the population variance of a normal distribution requires knowing the values of each data point in the population, which is almost impossible in reality because we can only obtain sample data. Therefore, the population variance is unknown, and the T-test should be used.

On the other hand, proportion-type indicators follow a binomial distribution. The calculation of the population variance of a binomial distribution does not require knowing the values of each data point in the population and can be obtained from sample data. Moreover, the sample size in A/B testing is generally much larger than 30, so the Z-test should be used. Here, the term proportion test specifically refers to the Z-test used for probability-type indicators.

After discussing various types of tests, let me summarize: For A/B testing, a two-tailed, two-sample proportion test (for probability-type indicators) or a T-test (for mean-type indicators) should be selected.

Now, returning to our case, since the click-through rate is a probability-type indicator, a two-tailed, two-sample proportion test should be chosen here.

How to Make Inferences Using Hypothesis Testing? #

After selecting the correct hypothesis and testing method, the next step is to test whether our hypothesis is correct. In A/B testing, this is the step of analyzing the test results.

Possible Results of A/B Testing #

Hypothesis testing can infer two results:

Accept the null hypothesis and reject the alternative hypothesis, which means that the metrics of the experimental group and the control group are the same.
Accept the alternative hypothesis and reject the null hypothesis, which means that the metrics of the experimental group and the control group are different.

However, please note that these two results are only inferred by hypothesis testing based on sample data and a series of statistical calculations. They do not represent the factual situation (population data). When considering the factual situation, there are four possible combinations of the inference results and the actual situation:

It can be seen that inference is only correct when the inference result of the hypothesis testing exactly matches the fact. Otherwise, two types of errors can occur.

Type I Error: Statistically, it is defined as rejecting the null hypothesis when the null hypothesis is actually true. In A/B testing, the null hypothesis is that the metrics of the two groups are the same. When the hypothesis test infers that the metrics of the two groups are different, but in fact, the metrics of the two groups are the same, it is a Type I error. We call the situation where the metrics of the two groups are different as a positive result. Therefore, Type I error is also called a false positive.

The probability of Type I error is denoted by α, also known as the significance level. “Significant” means that the probability of error occurrence is high. In statistics, events with occurrence rates less than 5% are considered to be low probability events, indicating that such events are not easily observed. Therefore, the significance level is generally set at 5%.

Type II Error: Statistically, it is defined as accepting the null hypothesis when the null hypothesis is actually false. In A/B testing, when the hypothesis test infers that the metrics of the two groups are the same, but in fact, the metrics of the two groups are different, it is a Type II error. We call the situation where the metrics of the two groups are the same as a negative result. Therefore, Type II error is also called a false negative. The probability of Type II error is denoted by β and is generally defined as 20% in statistics.

The concepts of these two errors may sound complicated and difficult to understand. Let me explain them specifically using an example of nucleic acid testing for COVID-19.

In this example, the null hypothesis is: the tested person is healthy and does not carry the COVID-19 virus.

Carrying the COVID-19 virus is considered positive, and not carrying the virus is considered negative. If a healthy person is tested and the result shows that the person carries the COVID-19 virus, it is a Type I error, rejecting the null hypothesis that is actually true, which is a false positive. If a COVID-19 patient is tested and the result shows that the person does not carry the virus, it is a Type II error, accepting the null hypothesis that is actually false, which is a false negative.

Now that we understand the possible results of hypothesis testing, how can we obtain the test results through hypothesis testing?

There are two commonly used methods in practice: the P-value method and the Confidence Interval method.

P-value method #

In statistics, the P-value is the probability of observing the sample data when the null hypothesis is true. In the context of A/B testing, the P-value is the probability of observing “the metrics of the experimental group and the control group are different” based on sample data when the metrics of the control group and the experimental group are actually the same.

If we observe a small probability event (P-value) of “the metrics of the experimental group and the control group are different” in A/B testing, for example, less than 5%, it is a rare event. Although this event is unlikely to occur when the null hypothesis is true, we have observed it. Therefore, we reject the null hypothesis and accept the alternative hypothesis, indicating that the metrics of the two groups are different.

On the contrary, when we observe a large probability event (P-value) of “the metrics of the experimental group and the control group are different” in A/B testing, for example, 70%, this event is still likely to occur when the null hypothesis is true. Therefore, we accept the null hypothesis and reject the alternative hypothesis, indicating that the metrics of the two groups are the same.

In statistics, we compare the P-value with the significance level α. Since α is generally set at 5%, we compare the P-value with 5% to determine the results of hypothesis testing:

When the P-value is less than 5%, we reject the null hypothesis and accept the alternative hypothesis. We conclude that the metrics of the two groups are different, which is also known as a significant result.
When the p-value is greater than 5%, we accept the null hypothesis and reject the alternative hypothesis, concluding that the two groups have the same results, also known as a nonsignificant result.

As for the specific calculation of the p-value, I recommend using tools such as Python or R:

For a proportion test, you can use the proportions_ztest function in Python or the prop.test function in R.
For a t-test, you can use the ttest_ind function in Python or the t.test function in R.

Confidence Interval Method #

A confidence interval is a range, usually accompanied by a percentage, most commonly 95% confidence interval. What does this mean? In statistics, for a random variable, there is a 95% probability that the interval contains the population mean. This is called a 95% confidence interval.

The statistical definition of a confidence interval may not be easy to understand, but you can think of it as the range of variability of a random variable. A 95% confidence interval is an interval that encompasses 95% of the entire range of variability.

The essence of A/B testing is to determine whether the metrics of the control group and the experimental group are equal. How do we determine this? The answer is to calculate the difference δ between the metrics of the experimental group and the control group. Because the metrics are random variables, their difference δ will also be a random variable with certain variability.

This means that we need to calculate the confidence interval of δ and then see if this interval includes 0. If it includes 0, it means that δ could be 0, indicating that the two groups have similar metrics; if it does not include 0, it means that the two groups have different metrics.

Regarding the specific calculation of the confidence interval, I also recommend using tools such as Python or R:

For a proportion test, you can use the proportion_confint function in Python or the prop.test function in R.
For a t-test, you can use the tconfint_diff function in Python or the t.test function in R.

Now, let’s go back to the case of the recommendation system, and I will use both the p-value method and the confidence interval method to make judgments based on the A/B test results.

Experimental group (new recommendation algorithm): Sample size of 43,578, with 2,440 clicks, a click-through rate of 5.6%.
Control group (old recommendation algorithm): Sample size of 43,524, with 2,089 clicks, a click-through rate of 4.8%.

In this case, I will use the prop.test function in R to calculate the p-value and confidence interval:

prop.test(x = c(2440, 2089), n = c(43578, 43524), alternative = "two.sided", conf.level = 0.95)

The following result is obtained:

We can conclude that the p-value is \(1.167 e^{-7}\), which is much smaller than 5% and close to 0. Therefore, we reject the null hypothesis, accept the alternative hypothesis, and infer that the metrics of the experimental group and the control group are significantly different.

At the same time, we can also conclude that the 95% confidence interval of the difference δ between the two groups’ metrics is [0.005, 0.011], which does not include 0. Therefore, we can also infer that the metrics of the two groups are significantly different.

Summary #

In today’s class, we focused on the theoretical foundation of A/B testing - hypothesis testing. We learned about hypotheses, testing, and related statistical concepts. Just remember the following two points:

First, for A/B testing, choose a two-tailed, two-sample proportion test (for probability metrics) or a t-test (for mean metrics). This determines how you select the parameters for testing when analyzing the A/B test results, so it is important.

Second, in A/B testing practice, these statistical concepts will be used when calculating sample size, metric variability, and analyzing test results.

When calculating sample size, use the concepts of Type I/Type II errors and their probabilities α and β.
When calculating metric variability, use variance and confidence intervals.
When analyzing A/B test results, use various tests, confidence intervals, and p-values.

The concepts and knowledge about hypothesis testing in this class are somewhat fragmented. To facilitate your understanding and memory in the future, I have prepared the following mind map for you:

With this, we conclude our statistics section. By now, you should have a good grasp of the basic statistical knowledge required for A/B testing. In fact, the content in the first two classes is more theoretical and may be more difficult to understand. However, if theoretical knowledge is only taught by rote, the effect may not be good. So, how can you master this theoretical knowledge? In my years of A/B testing practice, I have found that to truly understand and apply theoretical knowledge, it is necessary to think and practice more on your own. Once you have some practical experience, you will naturally appreciate the benefits of theoretical learning. And then, when you go back to review the theory, it will be much easier to understand.

Therefore, if there are any parts of today’s content that you still don’t understand, it’s okay, don’t set mental barriers for yourself, you can put it aside for now. In the upcoming classes, I will apply the theory we learned today to solve problems encountered in A/B testing. You can constantly review this theory during your learning process or take the initiative to search for more resources. When you finish the entire course and then come back to review these two theoretical sections, you will find that the theory is actually quite simple.

Next, we will move on to the “Fundamentals” section and delve into the detailed process of A/B testing!

Reflection Questions #

The statistical concepts covered in this lesson are often heard but difficult to understand. Do you have any unique insights on these concepts in your study of statistics? Feel free to share your thoughts and ideas in the comment section. Let’s engage in an exchange and discussion together. If you feel that you have gained something from this lesson, please consider sharing the course with your colleagues or friends to progress together!