06 Selection of Sample Size Is More Samples Always Better?

06 Selection of Sample Size_Is More Samples Always Better? #

Hello, I’m Bowen.

We have discussed a lot of the preparations for A/B testing earlier. We have established our goals and metrics and selected the experimental units. So, can we proceed with the testing now?

Hold on for a moment. We still need to address one final question before proceeding with the testing: What is the appropriate sample size?

Breaking the Misconception: More is Not Always Better in Sample Size #

If I ask you how many samples are appropriate for an A/B test, your first reaction would probably be that more is always better. The more samples, the more accurate the experimental results!

From a statistical point of view, this is indeed true. A larger sample size leads to a stronger representativeness of the sample. However, in practical business scenarios, it is actually better to have a smaller sample size.

Why is it so? Let me analyze it for you.

To understand this issue, you first need to know how long an A/B test should last. Let me give you a formula: Time Required for A/B Test = Total Sample Size / Sample Size Obtained per Day.

You see, from this formula, it can be seen that a smaller sample size means a shorter duration for the experiment. In actual business scenarios, time is often the most valuable resource. After all, fast iteration is expensive with the word “fast”.

In addition, the purpose of conducting A/B tests is to verify whether a certain change can improve the product or business. Of course, it is also possible that a certain change may harm the product or business. Therefore, there is a certain cost of trial and error. In this case, the smaller the experimental scope and sample size, the lower the cost of trial and error.

You see, the demands of sample size in practice and theory are actually contradictory. So, we need to strike a balance between statistical theory and practical business scenarios: in A/B testing, we need to ensure that the sample size is large enough while keeping the experiment as short as possible.

So, how should we determine the sample size?

You might say that there are many websites online that calculate sample size, can’t we just use those websites to calculate the appropriate sample size? This is indeed one way, but have you ever thought about whether these online calculators are suitable for all A/B tests? If they are not suitable, how should we calculate then?

In fact, we can only calculate the sample size correctly if we understand the principles behind sample size calculation.

Therefore, in this lesson, I will first familiarize you with the theoretical foundations of statistics, and then guide you through practical calculations, so that you can learn to calculate the sample size required for different types of evaluation indicators. Finally, I will provide you with a case study to help you master today’s content.

The Principles behind Sample Size Calculation #

Let’s get straight to the point. First, I will provide you with the formula for calculating sample size, and then I will explain it in detail:

\[- \\mathrm{n}=\\frac{\\left(Z\{1-\\frac{\\alpha}{2}}+Z\{1-\\beta}\\right)^{2}}{\\left(\\frac{\\delta}{\\sigma\{\\text {pooled}}}\\right)^{2}}=\\frac{\\left(Z\{1-\\frac{\\alpha}{2}}+Z\{\\text {power}}\\right)^{2}}{\\left(\\frac{\\delta}{\\sigma\{\\text {pooled}}}\\right)^{2}}- \]

Where:

\(Z\{1-\\frac{\\alpha}{2}}\) is the \(\\left(1-\\frac{\\alpha}{2}\\right)\) corresponding \(Z\) score. \(Z\{\\text {Power}}\) is the Z score corresponding to Power.- \(\\delta\) refers to the difference between the evaluation indicators of the experimental group and the control group.- \(\\sigma\_{\\text {pooled}}^{2}\) refers to the pooled variance of the experimental group and the control group.

From the formula, we can see that the sample size is mainly determined by α, Power, δ, and \(\\sigma\_{\\text {pooled}}^{2}\). We rely on these four factors to adjust the sample size. Now let’s discuss in detail how each factor affects the sample size n.

Among these four factors, I have already talked about α, δ, and \(\\sigma\_{\\text {pooled}}^{2}\) in previous lessons. So before discussing how each factor affects the sample size n, let me first introduce what Power actually means.

Understanding Power #

Power, also known as Statistical Power. In the second lesson on basic statistics, I explained the Type II Error rate β. In statistical theory, Power = 1-β.

Power is essentially a probability. In A/B testing, if the indicators of the experimental group and the control group are actually different, Power refers to the probability of detecting this difference through A/B testing.

This may still be somewhat abstract, but it’s okay. Power is indeed a difficult statistical concept to understand. I was also confused when I first encountered it. So, let me give you another example to help you understand Power.

The user registration rate of a social media app is low, and the product manager wants to increase the user registration rate by optimizing the user registration process. The user registration rate here is defined as the number of users who have completed the registration divided by the total number of users who initiated the registration, multiplied by 100%.

Now, we can use A/B testing to verify whether this optimization can indeed improve the user registration rate.

We divide the users into a control group and an experimental group, where:

  • The control group follows the normal user registration process: input basic personal information - verification via SMS/email - successful registration.
  • The experimental group, in addition to the normal user registration process, includes a third-party account login feature such as WeChat or Weibo. Users can register and log in with third-party accounts.

I believe you can guess without my explanation that the registration rate of the experimental group will definitely be higher than that of the control group because the experimental group simplifies the registration process for users. This means that the registration rates of these two groups are actually different.

Now, if the A/B test has a Power of 80%, it means that this A/B test has an 80% chance of accurately detecting the difference in registration rates between these two groups and obtaining statistically significant results. In other words, there is a 20% chance that this A/B test will mistakenly conclude that the registration rates of these two groups are the same.

As you can see, the larger the Power, the more accurate the A/B test is in detecting differences between the experimental and control groups (if they are actually different).

Let me give you another analogy. You can think of A/B testing as a radar detecting airborne objects. The sensitivity of a radar specialized in detecting small drones needs to be higher than that of a radar specialized in detecting large passenger planes. The smaller the object to be detected, the higher the sensitivity of the radar needed. In this case, the sensitivity of the radar is equivalent to the Power of A/B testing. The larger the Power, the better it can detect differences between the two groups.

So, think of Power as the sensitivity of A/B testing.

The Relationship between the Four Factors and the Sample Size n #

After understanding Power, let’s now look at the relationship between the four factors, α, Power, δ, and \(\\sigma\_{\\text {pooled}}^{2}\), and the sample size n.

1. Significance Level α

The significance level is inversely proportional to the sample size: the smaller the significance level, the larger the sample size. This is not difficult to understand because the significance level, also known as Type I Error rate α, needs a larger sample size to reduce the rate of Type I Error and obtain more accurate results.

2. Power (1 – β)

Power is directly proportional to the sample size: the larger the Power, the larger the sample size. A larger Power indicates a smaller Type II Error rate β. Similar to Type I Error, to reduce the rate of Type II Error and obtain more accurate results, a larger sample size is needed.

3. Pooled Variance \(\\sigma\_{\\text {pooled}}^{2}\) of the Experimental Group and Control Group

Variance is directly proportional to the sample size: the larger the variance, the larger the sample size.

As mentioned earlier, variance is used to describe the variability of evaluation indicators. A larger variance indicates a larger range of fluctuations in the evaluation indicators, which means they are more unstable. This requires more samples to conduct experiments and obtain accurate results.

4. Difference between the Evaluation Indicators δ of the Experimental Group and Control Group

The difference is inversely proportional to the sample size: the smaller the difference, the larger the sample size. This is because a smaller difference between the evaluation indicators of the experimental group and the control group is less likely to be detected by A/B testing. Therefore, we need to increase Power, which means we need a larger sample size to ensure accuracy.

How to calculate sample size in practice? #

In practice, the majority of A/B tests follow the conventions of statistics, which set the significance level to the default 5% and power to the default 80%. This determines the Z score in the formula, with two of the four factors (α and power) being fixed. Therefore, the sample size mainly depends on the remaining two factors: the pooled variance \(\\sigma\_{\\text {pooled}}^{2}\) of the experimental and control groups, and the difference \(\\delta\) between the evaluation metrics of the two groups. Therefore, the formula for sample size calculation can be simplified as:

\(- \\mathrm{n} \\approx \\frac{8 \\sigma\_{p o o l e d}^{2}}{\\delta^{2}}\)

Now, we can use this simplified formula to estimate the sample size.

In this formula, the variance is an attribute of the data and represents the fluctuation of the data, while the difference between the evaluation metrics of the two groups is related to the variables in the A/B test and their impact on the evaluation metrics.

The above formula is actually used to calculate the minimum sample size required to achieve statistical significance in A/B testing, assuming the pooled variance of the evaluation metrics of the two groups is \(\\sigma\_{\\text {pooled}}^{2}\) and the difference between the evaluation metrics is \(\\delta\).

Note that the emphasis here is on the word “minimum”. In theory, the larger the sample size, the better, with no upper limit. However, in practice, smaller sample sizes are preferred, so we need to find a balance between the two. Therefore, the sample size calculated from the formula is the optimal sample size that balances the two factors.

Once the sample size is calculated, the next step is to allocate the sample sizes for the control and experimental groups. This raises the question of how to allocate the sample sizes. In this regard, there is a common misunderstanding. Let’s analyze the issue of sample size allocation in detail.

The sample sizes of the control and experimental groups should be equal. #

If the sample sizes of the experimental and control groups in an A/B test are equal, i.e., a 50%/50% split, then the total sample size (sum of the sample sizes of the experimental and control groups) is:

\(\mathrm{Total \\ Sample \\ Size} = 2 \\times \\mathrm{Sample \\ Size} \\approx \\frac{16 \\sigma\_{p o o l e d}^{2}}{\\delta^{2}}\)

You may ask, is it necessary for the sample sizes of the experimental and control groups to be equal?

Although having unequal sample sizes for the two groups is theoretically feasible and can be done in practice, I strongly advise against it. Let me explain in detail.

A common misconception is that if the sample size of the experimental group is larger and the sample size of the control group is smaller (e.g., an 80%/20% split), significant results can be obtained more quickly. In reality, the opposite is true: having unequal sample sizes for the two groups will actually prolong the test.

Why is that? Because the minimum sample size required to achieve statistical significance, as calculated, is based on each group individually, not the total population. In other words, in the case of an unequal split, only if the sample size of the relatively smaller group reaches the minimum sample size, can the experimental results be statistically significant. It doesn’t mean that the larger the experimental group, the better, because the bottleneck lies in the smaller sample size of the control group.

Compared to an equal split of 50%/50%, an unequal split will result in two outcomes, both of which are unfavorable for the business:

  1. Decreased accuracy: If the same testing period is maintained, the sample size of the control group will be smaller, resulting in a lower power and decreased accuracy of the test results.
  2. Prolonged testing time: If the sample size of the control group is kept constant, more samples need to be collected by extending the testing time.

Therefore, only with an equal split of the two groups can the sample sizes of both groups reach their maximum, thus making the best use of the total sample size and ensuring that the A/B test is conducted faster and more accurately.

You may wonder, the estimation of this sample size is done before conducting the A/B test, but I haven’t conducted the experiment yet, so how do I know the difference \(\\delta\) between the evaluation metrics of the two groups?

Estimating the difference \(\\delta\) between the evaluation metrics of the experimental and control groups #

Here, of course, we do not know the results in advance, but we can estimate the difference \(\\delta\) between the evaluation metrics of the two groups using the following two methods.

The first method is to estimate it from the perspective of revenue and costs.

Any changes in business/products will incur costs, including but not limited to labor costs, time costs, maintenance costs, and opportunity costs. So, can the total revenue generated by the changes offset the costs and result in a positive net income?

For example, suppose we want to increase the user registration rate of an app by optimizing the registration process. Let’s assume the cost of the optimization process is around 30,000 yuan (mainly labor and time costs), the registration rate before optimization is 60%, and the number of people starting the registration per day is 100. The average spending per new user is 10 yuan. If the registration rate increases to 70% after optimization, it will generate an additional 36,500 yuan of revenue in a year ((70%-60%)*100*10*365). In this case, the net income within a year will be positive, indicating that the optimization process not only recovers the costs but also generates profit. This demonstrates that a 10% difference is an ideal improvement.

Of course, when making the corresponding changes, we want to achieve a positive net income. Therefore, usually we calculate the difference that will reach break-even point, denoted as \(\\delta\{\\text {break-even}}\), and we hope that \(\\delta\ \geq \\delta\{\\text {break-even}}\). In this example, \(\\delta\_{\\text {break-even}}\) = 8.2% (30,000/10/100/365), so we would like the minimum difference \(\\delta\) to be at least 8.2%.

The second method is if it is difficult to estimate the revenue and costs, we can look for clues from historical data and calculate the average value and range of fluctuation of these evaluation metrics using the methods I introduced in Lesson 4 on calculating the variability of metrics. This can provide an estimate of the approximate difference \(\\delta\) between the evaluation metrics of the experimental and control groups. For example, if our evaluation metric is click-through rate, and we calculate the average click-through rate based on historical data to be 5% with a fluctuation range of [3.5%, 6.5%], then our expected value for the evaluation metric in the experimental group should be greater than this fluctuation range, such as 7%. In this case, δ would be equal to 2% (7% - 5%).

Calculating the overall variance of the experimental group and control group\(\\sigma\_{\\text {pooled}}^{2}\) #

As for calculating the overall variance \(\\sigma\_{\\text {pooled}}^{2}\) of the two groups, it mainly involves selecting historical data and using appropriate statistical methods based on the type of evaluation metric. Evaluation metrics can be categorized into probability metrics and mean metrics.

For probability metrics, they usually follow a binomial distribution and the overall variance is calculated as: \(- \\sigma\{\\text {pooled}}^{2}=p\{\\text {test}}\\left(1-p\{\\text {test}}\\right)+p\{\\text {control}}\\left(1-p\{\\text {control}}\\right)- \) where \(p\{\\text {control}}\) is the probability of the event occurring in the control group, which can be calculated from historical data, and \(p\{\\text {test}}=p\{\\text {control}}+\\delta\) represents the expected probability of the event occurring in the experimental group.

For mean metrics, they usually follow a normal distribution. When the sample size is large, according to the central limit theorem, the overall variance can be calculated as: \(- \\sigma\{p o o l e d}^{2}=\\frac{2 \* \\sum\{i}^{n}\\left(x\_{i}-\\bar{x}\\right)^{2}}{n-1}- \) where:

  • n represents the size of the historical data sample.
  • \(x\_{i}\) represents a specific value (e.g., the usage duration or purchase amount) of the i-th user in the historical data sample.
  • \(\\bar{x}\) represents the average value (e.g., the average usage duration or purchase amount) of the users in the historical data sample.

Alright, with that, we have covered all the core content for this lesson. However, in order to help you better understand the principles and calculation methods of these formulas, let me provide an example of how to calculate the sample size using the case of optimizing the registration process to increase the user registration rate.

Example Explanation #

We can use the formula mentioned earlier to calculate the sample size:

First, let’s calculate the difference δ between the evaluation metrics of the experimental group and the control group. In the previous example of optimizing the user registration rate for an app, we estimated \(\\delta\_{\\text {break-even}}\)=8.2% based on cost and revenue considerations.

Next, let’s calculate \(\\sigma\{\\text {pooled}}^{2}\). Based on the historical data, we found that the registration rate is approximately 60% (\(p\{\\text {control}}\)). Combined with the calculated \(\\delta\{\\text {break-even}}\)=8.2%, we can set the registration rate after changing the process to 68.2%. Then, using the formula for probability metrics, we can calculate \(\\sigma\{\\text {pooled}}^{2}\) = 60%(1-60%) + 68.2%(1-68.2%) = 0.46.

Finally, in the A/B test, we evenly split the experimental group and the control group by 50%/50%. By using the formula, we can calculate the total sample size:

This gives us a minimum sample size of 548 per group, completing the calculation of the sample size.

Do you remember the various online A/B test sample size calculators I mentioned earlier? Like this one. If you carefully examine these calculators, you will notice that almost all of them require you to input the following four parameters:

  1. Baseline Conversion Rate.
  2. Minimum Detectable Lift or Optimized Version Conversion Rate.
  3. Confidence Level (1-α) or Significance Level α.
  4. Statistical Power (1-β).

If you pay close attention, you may have noticed that these parameters are all used to calculate probability metrics. Therefore, the current online sample size calculators can only calculate sample sizes for probability metrics and not for mean metrics. In such cases, I recommend calculating the sample size using formulas.

To facilitate the calculation of sample sizes for various metrics in A/B testing, in the final lesson of this column, I will teach you how to create an online sample size calculator using R that can calculate sample sizes for both probability metrics and mean metrics. Stay tuned!

Summary #

In this lesson, we mainly learned how to determine the sample size required for an A/B test and understood the theoretical foundation behind it. I summarized four factors that affect sample size, where an up arrow indicates an increase and a down arrow indicates a decrease.

Here I would like to emphasize again that the method introduced in this lesson for calculating the sample size of an A/B test is an estimation before the test. It is aimed at achieving the minimum sample size required for statistical significance in the A/B test results. Therefore, as long as the actual sample size is larger than the minimum sample size, it is sufficient. Of course, if business conditions allow, a larger sample size is always better.

Finally, I would like to say that when we use online A/B test sample size calculators, we need to pay attention to the parameters we input, because most calculators require users to input conversion rates, which can only calculate probability-based metrics. So, when calculating probability-based metrics, we can use online calculators. However, if we need to calculate other types of metrics (such as mean-based metrics), we cannot use online calculators. We still need to rely on formulas to calculate the minimum sample size required for the test or follow along with me in the column’s final section to create an online sample size calculator that includes both probability-based and mean-based metrics.

Discussion Questions #

Have you ever used an online A/B test sample size calculator? Have you ever wondered why most of the online sample size calculators can only calculate probability metrics and not mean metrics?

Feel free to leave a comment and discuss in the comments section. You are also welcome to click on “Share with a Friend” to share today’s content with your colleagues and friends, and learn and grow together. Alright, thank you for listening, we’ll see you in the next lesson.