01 Statistical Fundamentals Understanding Statistical Properties of System Control Indicators

01 Statistical Fundamentals_Understanding Statistical Properties of System Control Indicators #

Hello, I am Bowen.

When studying and solving technical problems, we all know the saying “know the cause and effect.” So, what is the “cause and effect” behind A/B testing? In my opinion, it is the calculation principle behind A/B testing. Understanding why A/B testing is designed this way and why certain metrics and test methods are chosen in best practices.

Speaking of the calculation principles behind A/B testing, we first need to know that the theoretical foundation of A/B testing is hypothesis testing. It can be said that hypothesis testing runs through the entire process of A/B testing, from experimental design to analysis of test results.

If we were to explain “hypothesis testing” in one sentence, it would be selecting a suitable test method to verify whether the hypothesis we proposed in A/B testing is correct. Now, all you need to know is that the most important and core part of “hypothesis testing” is “testing,” because the choice of test method depends on the statistical properties of the metric.

In other words, understanding the statistical properties of metrics is a prerequisite for mastering hypothesis testing and A/B testing, and it is the first step in “knowing the cause and effect.”

As for the task of gaining a deeper understanding of and effectively using “hypothesis testing,” we will leave it for the next lesson.

What are the statistical attributes of indicators? #

In practical business, the indicators we commonly use can be divided into two categories:

Mean-based indicators, such as the average usage duration of users, average purchase amount, average purchase frequency, and so on.
Probability-based indicators, such as the probability of user clicks (click-through rate), conversion probability (conversion rate), purchase probability (purchase rate), and so on.

Clearly, these indicators are used to represent user behavior. User behavior is highly random, which means that these indicators are variables composed of a series of random events, known as random variables in statistics.

“Random” means that they can take different values. For example, for a social media app, the daily usage time may be less than 1 hour for casual users, while it may be 4 or 5 hours or more for heavy users. So, how do we represent this in statistics?

Yes, we can use probability distributions to represent the probabilities and ranges of different values taken by random variables. Therefore, the statistical attributes of A/B testing indicators depend on the probability distributions they follow.

Here, let me give you the conclusions: Mean-based indicators follow a normal distribution when the sample size is large enough; probability-based indicators essentially follow a binomial distribution, but when the sample size is large enough, they also follow a normal distribution.

When you see these two conclusions, you may have many questions:

What is a normal distribution? What is a binomial distribution?
How large does the sample size need to be for it to be considered “large enough”?
Why can probability-based indicators follow both a binomial distribution and a normal distribution?

Don’t worry, I will answer these questions one by one.

Normal Distribution #

The normal distribution is the most common distribution in A/B testing metrics and is essential for calculating sample size and analyzing test results.

In statistics, if a random variable x follows a probability density function (PDF) of:

\[- f(x) = \\frac{1}{\\sigma \\sqrt{2 \\pi}} e^{-\\frac{1}{2}\\left(\\frac{x-\\mu}{\\sigma}\\right)^{2}}- \]

\[- \\begin{aligned}- \\mu &= \\frac{x_{1}+x_{2}+\\cdots+x_{n}}{n} \\\\\\- \\sigma &= \\sqrt{\\frac{\\sum_{i}^{n}(x_{i}-\\mu)^{2}}{n}}- \\end{aligned}- \]

then x follows a normal distribution.

In the equation, μ represents the mean (average) of x, σ represents the standard deviation of x, n represents the number of random variables x, and xi represents the value of the ith x.

The histogram of a random variable x that follows a normal distribution is shown below:

A histogram is a visual representation of the distribution of a random variable. The x-axis represents possible values of x, and the y-axis represents the probability of each value occurring. By looking at the histogram, you can see that the probability of values appearing decreases the further they are from the mean μ.

In addition to the mean μ, another important parameter, the standard deviation σ, is visible in the histogram and probability density function. σ measures the dispersion (deviation from the mean) of the random variable by calculating the difference between each value of the random variable and the mean μ.

Next, let’s see how the standard deviation σ affects the distribution of a random variable.

To better understand, let’s simulate a simple example using Python. We randomly select a variable x that follows a normal distribution with a mean μ=0. We set the standard deviation σ of x to be 1.0, 2.0, 3.0, and 4.0, respectively, and plot the histograms. The corresponding Python code and histograms are shown below:

from scipy.stats import norm
import numpy as np
import matplotlib.pyplot as plt

## Create the figure
fig, ax = plt.subplots()
x = np.linspace(-10,10,100)
sigma = [1.0, 2.0, 3.0, 4.0]
for s in sigma:
    ax.plot(x, norm.pdf(x,scale=s), label='σ=%.1f' % s)

## Add legends
ax.set_xlabel('x')
ax.set_ylabel('Density')
ax.set_title('Normal Distribution')
ax.legend(loc='best', frameon=True)
ax.set_ylim(0,0.45)

ax.grid(True)

By looking at this histogram, can we understand the impact of standard deviation σ on the distribution of random variables more intuitively? The larger σ is, the greater the deviation of x from the average μ, the wider the range of x values, the greater the volatility, and the more dispersed the histogram.

Let’s use another example in life to understand standard deviation. In a final exam, the average score in two classes, A and B, is both 85 points. Among them, the score range of class A is from 70 to 100, and the standard deviation of the scores, calculated by the formula, is 5 points; the score range of class B is from 50 to 100, and the standard deviation of the scores is 10 points. You see, the distribution range of class A’s scores is relatively small, centered around 85 points, so the standard deviation is also smaller.

Speaking of standard deviation, you should also think of another concept used to describe the degree of dispersion of random variables, which is variance. In fact, variance is the square of the standard deviation. Therefore, standard deviation σ and variance can be interchanged in representing the degree of dispersion.

With variance and standard deviation, we can describe the degree of dispersion of business indicators, but we still need one more step to calculate the range of fluctuations of business indicators (I will explain the specific calculation method in Lecture 4). This step is the z-score.

To explain the z-score, let’s introduce a special normal distribution, which is the standard normal distribution with the average μ=0 and the standard deviation σ=1.

The histogram of the standard normal distribution is shown below:

The horizontal axis here is the z-score, also called the standard score:

\[- \\mathrm{z} \\text { score }=\\frac{x-\\mu}{\\sigma}- \]

In fact, any normal distribution can be standardized into a standard normal distribution. The process of standardization is to transform the random variable x into the z-score according to the above formula. Different values of the z-score represent how many standard deviations σ the value of x deviates from the average μ. For example, when the z-score is equal to 1, it means that the value deviates from the average by 1 standard deviation σ.

Let’s use an example of a business indicator in a social app to further strengthen our understanding of the normal distribution.

Now there is a social app, and we want to understand the probability distribution of the daily average usage time t of users. Based on the existing data, we have made a histogram of the time that 10,000 users use the app each day in a month:

It can be seen that the daily average usage time t of these 10,000 users is approximately in the range of 3-5 hours, and it is an approximately normal distribution with a bell-shaped curve, indicating that the distribution of t can also be approximated as a normal distribution.

Central Limit Theorem #

This is actually a characteristic of mean-type variables: when the sample size is large enough, mean-type variables tend to have a normal distribution. The theoretical basis behind this is the central limit theorem.

The mathematical proofs and reasoning processes of the central limit theorem are very complex, but don’t be afraid. We only need to understand its general principle: no matter what the probability distribution of the random variable is, as long as the sample size is large enough, the distribution of the average values of these samples will tend to be a normal distribution.

So, how large is this “large enough” sample size?

Conventionally, it is considered large enough when the sample size is more than 30. In the era of big data, our sample sizes can generally easily exceed this threshold, so mean-type indicators can be approximated as a normal distribution.

By now, we understand both the specific quantity required for “large enough” and what a normal distribution is. Next, we will learn about binomial distribution, and then we can understand why probability-type indicators can follow both a binomial distribution and a normal distribution.

Binomial Distribution #

In probability analysis, when it comes to user behavior, there are only two possible outcomes: either it occurs or it doesn’t. For example, the click-through rate is used to describe the probability of users clicking on specific content online. A user either clicks or doesn’t click, there is no third outcome.

Such events with two possible outcomes are called binary events. Binary events are common in everyday life, such as when flipping a coin, where the possible outcomes are either heads or tails. Therefore, statistics has a specific term to describe the probability distribution of binary events, known as the binomial distribution.

Let’s continue with the example of a social app to learn more about the binomial distribution.

This social app has placed advertisements online to attract people to click on the ads and download the app. Now we want to see the distribution of app download rates based on data:

Download Rate = Number of users who downloaded the app through the advertisement / Number of users who saw the advertisement.

Since the outcome of a single binary event can only be either occur or not occur, with a probability of 100% or 0%, we need to aggregate the data to analyze download rates. Here, we’ll use minutes as an example, first calculating the download rate for each minute and then examining their probability distribution.

We have a month’s worth of user and download data, which has 43,200 minutes in total (60 * 24 * 30). Since we are interested in the download rate per minute, we have a total of 43,200 data points. Through data analysis, we find that, on average, there are 10 people per minute who see the advertisement, and the download rates are mainly distributed between 0% and 30%.

The following chart shows the probability distribution of download rates per minute:

You might say that probability is, to some extent, an average value, and you can interpret the download rate here as the “average download count of users who see the advertisement.” We already have 43,200 data points, which is far greater than 30, so why doesn’t the distribution of download rates approximate a normal distribution as described by the central limit theorem?

This is because in the binomial distribution, when the central limit theorem refers to the sample size, it means the sample size used to calculate probabilities. In the example of the social app, the sample size for probabilities is 10, because on average, there are 10 people who see the advertisement per minute, which has not reached the threshold of 30 mentioned in the central limit theorem. Therefore, we need to increase this sample size to make the distribution of download rates approximate a normal distribution.

Increasing the sample size is straightforward. We can calculate the download rate per hour since on average, 600 people see the advertisement per hour. Thus, our sample size increases from 10 to 600. The following chart shows the probability distribution of download rates per hour:

Now, looking at this histogram, the distribution of download rates per hour does approximate a normal distribution, doesn’t it? The average download rate in the graph is approximately 10%.

In the binomial distribution, there is an empirical formula derived from practice: min(np, n(1-p)) >= 5. Here, n represents the sample size and p represents the average probability.

This formula states that the smaller value between np or n(1-p) must be greater than or equal to 5. Only when the binomial distribution satisfies this formula can it be approximated as a normal distribution. This is a variant of the central limit theorem in the binomial distribution.

In our example, when calculating the probability distribution of download rates per minute, np = 10 * 10% = 1, which is less than 5, so it cannot be approximated as a normal distribution. When calculating the probability distribution of download rates per hour, np = 600 * 10% = 60, which is greater than or equal to 5, so it can be approximated as a normal distribution.

We can use this formula to quickly determine whether probability indicators can be approximated as a normal distribution. However, you can also imagine A/B testing in practice, where the sample size is usually large enough to satisfy the formula mentioned above.

Summary #

In today’s class, we primarily studied the prerequisites for A/B testing and hypothesis testing, which are the statistical properties of metrics. I have summarized them into one theorem, two distributions, and three concepts:

One theorem: Central Limit Theorem.
Two distributions: Normal distribution and Binomial distribution.
Three concepts: Variance, standard deviation, and z-score.

There are many types of distributions for random variables in real life. Today, I focused on introducing the normal distribution and the binomial distribution, which correspond to the two most common types of business metrics: mean-based and probability-based.

Moreover, it’s important to know that with the Central Limit Theorem, we can approximate most of the metrics in business as a normal distribution. This is crucial because many important steps in A/B testing, such as calculating sample size and analyzing test results, are based on the assumption of a normal distribution for the metrics.

Additionally, you can use variance and standard deviation to understand the variability of business metrics. Combined with the z-score, you can calculate the range of fluctuation for business metrics. Only by understanding the range of fluctuation for metrics can we obtain more accurate test results.

In the next class, we will continue studying the statistical foundation of A/B testing, specifically hypothesis testing and related statistical concepts.

Thought-provoking question #

I was confused when I first learned about the binomial distribution and how it can be approximated to a normal distribution. Here, we can discuss any difficulties we encountered while studying the statistical process of A/B testing and how we resolved them.

Feel free to share your thoughts and ideas in the comments section. We can have a discussion and exchange ideas together. If you find it helpful, please share this course with your colleagues or friends so we can progress together!