04 Determination of Indicators How to Choose From So Many Indicators

04 Determination of Indicators_How to Choose from so Many Indicators #

Hello, I am Bowei.

In the last lesson, we learned several methods for determining evaluation metrics, including setting goals for different stages of product/business quantitatively, using a combination of quantitative and qualitative methods, or borrowing from the experiences of other companies in the industry. You may have also noticed that these methods are limited in that they can only select a single evaluation metric and do not take into account the impact of the volatility of the evaluation metric on the accuracy of the results.

Today, we will go further to explore the methods for determining evaluation metrics in practical and complex business scenarios, as well as methods for calculating the volatility of metrics. Then, we will discuss how to determine guardrail metrics to ensure the reliability of A/B test results.

Establishing Overall Evaluation Criteria by Integrating Multiple Metrics #

In practical business scenarios, there are often multiple objectives, and even a single objective may have multiple important evaluation metrics that need to be considered in a comprehensive manner. For individual metrics, we can use the methods discussed in the previous lesson to determine their significance. However, when it comes to considering multiple metrics together, how should we proceed?

Let’s look at an example.

One important channel for Amazon to communicate with its users is email. Amazon has a platform specifically designed for sending emails to users, and it employs two methods to accurately target users:

Using users’ historical purchasing data to build their personal preferences and sending recommendations through recommendation algorithms.
Amazon’s editorial team manually selects recommended products and sends them to users via email.

Once the accurate user targeting is determined, Amazon faces another challenge: What metric should be used to measure the effectiveness of the emails?

You might think that sending emails to users is aimed at encouraging them to make purchases, so the revenue generated from the emails could be used as an evaluation metric.

In reality, that’s exactly what Amazon did initially. They had a hypothesis that sending more emails would increase additional revenue, and the evaluation metric was the revenue generated from the emails.

At this point, a hypothetical A/B test was designed:

In the experimental group, we send emails to users.
In the control group, we do not send any emails to users.

The result is predictable. The control group did not receive any emails, so there would be no revenue generated from emails. On the other hand, the users in the experimental group received many emails, resulting in significant revenue.

The fundamental reason for this result is that the metric itself is monotonically increasing. In other words, the more emails that are sent, the more users will click on them, and the more revenue will be generated from the emails. Therefore, if you want more revenue, you need to send more emails.

However, in reality, when users receive too many emails, they will consider them as spam and feel harassed. As a result, it affects their user experience, and they may choose to unsubscribe from the emails. Once a user unsubscribes, they will no longer receive any emails from Amazon in the future.

Using revenue generated from emails as an evaluation metric may optimize short-term revenue, but it fails to consider the long-term value of users. Once users unsubscribe due to harassment, Amazon loses the opportunity to market to them through emails in the future. Therefore, revenue generated from emails is not a suitable evaluation metric. We need to consider the benefits and potential losses of sending emails, and integrate multiple metrics to establish an overall evaluation criteria (OEC).

So, how do we do that? We can calculate the OEC for each experimental/control group:

\[- \\mathrm{OEC}=\\frac{\\left(\\Sigma\_{i}{\ Revenue-S\times Unsubscribe\\\_lifetime\\\_loss}\\right)} {n}- \]

Let me explain this formula in more detail:

i represents each user.
S represents the number of unsubscribers in each group.
Unsubscribe_lifetime_loss represents the estimated loss caused by user unsubscribes.
n represents the sample size of each group.

When Amazon implemented the OEC, they were surprised to find that more than half of the emails had a negative OEC, which indicates that sending more emails doesn’t always lead to positive returns.

After realizing the significant long-term losses caused by unsubscribes, Amazon improved its unsubscribe page by providing options to unsubscribe from specific categories of emails instead of unsubscribing from all Amazon emails. For example, users can choose to only unsubscribe from Amazon books, thus avoiding unsubscribing from all emails and reducing potential long-term losses.

From this analysis, we can see that when evaluating a multifaceted matter, it is crucial to consider multiple metrics in order to grasp the overall situation. This is the most obvious advantage of using OEC. The most common type of OEC is the one that combines potential benefits and losses resulting from changes, just like what Amazon did. It is important to note that the “loss” here could also be a safeguard metric, which means that the OEC may include safeguard metrics.

Furthermore, another benefit of using OEC is that it helps avoid the problem of multiple testing. If we compare different metrics individually without weighting them together, we may encounter the issue of multiple testing, which leads to inaccurate results in A/B testing. The multiple testing problem is a very common pitfall in A/B testing, and I will explain it in detail in the advanced section.

By addressing the limitation of a single evaluation metric in complex A/B testing scenarios, we can proceed to the final key point of evaluating metrics: volatility. In real business scenarios, the values of evaluation metrics fluctuate due to various factors. Ignoring this fact could result in incorrect conclusions from the tests.

How to measure the volatility of evaluation indicators? #

Do you remember the example of the music app with the “add autoplay feature” from our last class?

Suppose that before this music app had the autoplay feature, the fluctuation range of the monthly user retention rate was [65%-70%]. In our A/B test, we found that the retention rate in the experimental group (with autoplay feature) was 69%, which is indeed higher than the retention rate of 66% in the control group (without autoplay feature).

So, is this result reliable? Did we achieve the goal of the A/B test? The answer is obviously no.

Although the data in the experimental group is better than that in the control group, all of these data are within the normal range of fluctuations. Therefore, the causal relationship between “adding autoplay feature” and “improving retention rate” was not established in this experiment, because the difference between these two groups of indicators may be just a normal fluctuation. However, if we do not know the volatility of the evaluation indicators and the normal range of fluctuations in advance, we may establish a wrong causal relationship.

So, how can we determine the normal range of fluctuations for evaluation indicators?

In statistics, the volatility of indicators is usually represented by the mean and standard deviation of the indicators. The larger the standard deviation of the mean, the greater the volatility. Just note that the standard deviation of the mean is also called the standard error. Concerning the concept of standard deviation, you can review the basics of statistics in the first lesson.

The normal range of fluctuations for evaluation indicators is called the confidence interval. So how do we calculate it?

In practice, there are two main methods to calculate the range of fluctuations: statistical formulas and practical experience.

First, calculate based on statistical formulas.

In statistics, the confidence interval is generally constructed using the following formula:

Confidence Interval = Sample Mean ± Z-score * Standard Error

According to the Central Limit Theorem, when the sample size is large enough, most of the data follows a normal distribution. So here we use the z-score. In general, we use a 95% confidence interval, which corresponds to a z-score of 1.96.

To illustrate the confidence interval vividly, let’s assume that the sample mean of an indicator is 50 and the standard error is 0.1, following a normal distribution. Then the 95% confidence interval of this indicator would be [50-1.960.1, 50+1.960.1] = [49.8, 50.2].

You may notice that when I used the above formula to calculate the confidence interval, I assumed a standard error. In practice, we need to calculate the standard error ourselves. Furthermore, calculating the standard error is a crucial step.

For simple indicators, mainly probability and mean-related ones, we can use statistical formulas to calculate the standard error.

For probability-related indicators, common ones include click-through rate, conversion rate, and purchase rate, etc.

These indicators generally follow a binomial distribution in statistics. In the case of a sufficiently large sample size, they can also be approximated to a normal distribution (concerning binomial distribution and normal distribution, you can review the relevant content in the first lesson).

Therefore, we can calculate the standard error of probability indicators using the following formula: ( \text{{Standard Error}} = \sqrt{\frac{p(1-p)}{n}} )

Where p represents the probability of the event occurrence.

For mean-related indicators, common ones include average user usage time, average purchase amount, average purchase frequency, etc. According to the Central Limit Theorem, these indicators usually follow a normal distribution.

Therefore, we can calculate the standard error of mean-related indicators as follows:

( \text{{Standard Error}} = \sqrt{\frac{s^{2}}{n}} = \sqrt{\frac{\sum_{i}^{n}(x_{i}-\bar{x})^{2}}{n(n-1)}} )

Where s represents the sample standard deviation, n is the sample size, ( x_{i} ) represents the usage time or purchase amount, etc., of the i-th user, ( \bar{x} ) represents the average usage time or purchase amount, etc.

Second, determine based on practical experience.

In practical applications, some complex indicators may not follow a normal distribution or we may not know their distribution at all, making it difficult or even impossible to find the corresponding statistical formulas for calculation. In such cases, to obtain the volatility range of the evaluation indicators, we need to estimate it based on practical experience.

1. A/A test

We can conduct multiple A/A tests with different sample sizes, calculate the size of the indicators for each sample, and then arrange them in ascending order. After removing the smallest 2.5% and largest 2.5% values, we get the 95% confidence interval.

2. Bootstrapping algorithm

We can first run a large sample A/A test, and then perform random sampling with replacement within this large sample. We extract samples of different sizes to calculate the indicators separately. Then, using the same process as the A/A test mentioned before, we arrange these indicators in ascending order. After removing the smallest 2.5% and largest 2.5% values, we obtain the 95% confidence interval.

In practical applications, Bootstrapping is more popular because it only requires running one A/A test, saving time and resources.

However, it is worth noting that even for those simple indicators that follow a normal distribution, can be directly calculated for variance using statistical methods, and have conditions and time available, I also recommend using both statistical formulas and Bootstrapping to calculate variance separately. If there is a significant difference between the results from the two methods, more A/A tests need to be conducted. Therefore, it is safer to verify the results from both perspectives.

With this, we have completed the learning of the methods for selecting evaluation indicators and the potential pitfalls regarding volatility. Next, we will move on to the final part of selecting guardrail indicators and providing quality assurance for A/B testing.

Guardrail Metrics #

A/B testing often changes a certain part of metrics (evaluation metrics) in the business, so we tend to focus only on short-term changes, but lose sight of the overall situation of the business (such as long-term profitability/user experience) or the statistical reliability checks. Therefore, in practice, I would recommend that every A/B test should have corresponding guardrail metrics.

Next, let’s learn how to select guardrail metrics from two dimensions: business quality and statistical quality. Let me summarize it for you with a graph:

Business Quality Level #

At the business level, guardrail metrics ensure both user experience and profitability/user engagement. Therefore, the guardrail metrics we usually use are mainly three: latency, crash rate, and per capita metrics.

Latency

Web page loading time, app response time, etc., are all guardrail metrics that represent latency. Adding product functionalities may increase the response time of web pages or apps, which can be perceived by users. At this time, it is necessary to add guardrail metrics that represent latency in A/B testing to minimize the impact on user experience while adding product functionalities (usually by optimizing the underlying code).

Crash Rate

For different applications, whether on personal computers or mobile devices, crashes may occur due to CPU, memory, or other reasons, resulting in the sudden shutdown of the program.

Speaking of which, let me share an interesting story with you. When I was using MS Word to write the content of this lesson, the software crashed. The key was that I hadn’t saved it at that time, and I thought that my hours of effort would be wasted, making me very frustrated. Fortunately, MS Word has an auto-save feature.

You see, although the probability of crashes is not high, it seriously affects the user experience. Therefore, when testing new features of an application, especially for some major changes, crash rate is a good guardrail metric.

Per Capita Metrics

Per capita metrics can be considered from two perspectives:

Revenue Perspective , such as per capita spending, per capita profit, etc.
User Engagement Perspective , such as per capita usage time, per capita usage frequency, etc.

Both perspectives are usually the goals pursued in actual business. Revenue perspective represents the profitability of products, and user engagement perspective represents user satisfaction. However, in specific A/B testing, we often only focus on the tested part of the product’s functionality, neglecting the overall situation.

For example, after optimizing the recommendation algorithm of an app store, the recommended content is more tailored to users’ preferences, which increases the click-through rate of recommended content. We focus on the evaluation metric of click-through rate increasing, and everyone is happy, right? Not really, because after analysis, we found that the proportion of free apps in the recommended content of this new algorithm has increased, which has led to a decrease in per capita spending and thus affected the overall revenue of the app store.

At this time, we can use per capita revenue as a guardrail metric and continue to optimize the recommendation algorithm.

Statistical Quality Level #

In statistics, the main goal is to eliminate biases as much as possible, making the experimental group and the control group as similar as possible. For example, checking the proportion of the sample sizes in the two groups and checking whether the distributions of the features in the two groups are similar.

There are many reasons for biases, such as bugs in the random grouping algorithm, insufficient sample size, or data delay in triggering experimental conditions. However, most biases are caused by engineering problems in specific implementations. These biases will affect us from obtaining accurate experimental results, and guardrail metrics are the tool we use to discover these biases!

1. Proportion of Sample Sizes in Experimental/Control Groups

When designing A/B tests, we will pre-allocate the experimental group and the control group, usually with equal sample sizes. In other words, the expected ratio of the sample sizes in the experimental group to the control group is 1:1=1. But sometimes, after the experiment, we find that the ratio of the two is not equal to 1, or not even close to 1. This indicates that there were problems during the implementation of the experiment, resulting in biases between the experimental group and the control group.

2. Distribution of Features in Experimental/Control Groups

A/B testing generally adopts random grouping to ensure that the two groups of experimental subjects are similar, so as to control other variables and only change the unique variable we care about (i.e., the cause of the A/B test).

For example, if users are treated as experimental units, when analyzing the data of the two groups after the experiment, the distribution of basic information such as age, gender, and location should be similar in the two groups, so as to eliminate biases. Otherwise, the experiment itself is problematic, and the results obtained are not reliable.

Summary #

Today, we learned how to select evaluation indicators and address the issue of volatility of evaluation indicators in complex business scenarios, as well as how to select guardrail indicators.

In cases where there are multiple indicators, we can combine them together to establish overall evaluation criteria, known as OEC. One thing to note here is that the units and magnitudes of different indicators may not be on the same scale. We need to normalize each indicator first so that their values fall within a certain range, such as [0, 1]. After that, we can combine them to eliminate the influence of indicator units/magnitudes.
The normal fluctuation range of evaluation indicators is known as the confidence interval. Calculating the confidence interval is a key point. For indicators with complex distributions, I recommend using bootstrapping to calculate it. For probability or mean-based indicators that follow binomial or normal distributions, it is advisable to calculate it using both statistical formulas and bootstrapping.
When selecting guardrail indicators in practice, we mainly consider two dimensions: business quality and statistical quality. The selectable guardrail indicators include network latency, crash rate, per capita indicators, the ratio of sample sizes between experimental and control groups, and the distribution of features in the experimental and control groups, etc.

Thought provoking question #

Have you encountered corresponding guardrail indicators in the A/B tests you have encountered in your work before? If so, what specific indicators are they? What is the purpose of these guardrail indicators?

Feel free to share your thoughts and ideas in the comments. We can discuss and learn together. If you find it helpful, please feel free to share this course with your colleagues or friends to progress together!