05 Selection of Experimental Units What Kind of Experimental Units Are Appropriate

05 Selection of Experimental Units_What Kind of Experimental Units are Appropriate #

Hello, I am Bowei.

In the previous lesson, we determined the goals, hypotheses, and various metrics of the experiment. Today, let’s talk about the third step of A/B testing: how to select appropriate experimental units.

As mentioned earlier, A/B testing is essentially a controlled experiment. And since it’s an experiment, we need experimental units. After all, only by determining the experimental units can we allocate samples (Assignment) at this unit level and decide which samples are in the treatment/test group and which samples are in the control group.

Speaking of experimental units, you may ask, what’s so difficult to understand about this? Aren’t the experimental units just users?

In fact, this is a very common misconception. Apart from testing the performance of the system, in the vast majority of cases, to be precise, the experimental units are actually the actions of users. Because the adjustments we make in product, marketing, and business are fundamentally aimed at observing whether there will be corresponding changes in user behavior.

So here’s the question, many units can represent user behavior. Should we take users as the unit, each visit or access of the user as the unit, or each page visited by the user as the unit?

In this lesson, we will learn about the commonly used experimental units and the three principles for selecting experimental units in practice.

What are the experimental units? #

Although there are many experimental units that can characterize user behavior, generally speaking, we can study from three dimensions: user level, visit level, and page level.

User Level #

The user level refers to dividing the experimental group and the control group based on individual users as the smallest unit.

So, what specific data is included at the user level? In fact, there are mainly four types of IDs.

The first type is the user ID, which is the username, mobile number, email address, etc. used when the user registers or logs in.

This type of ID contains personal information and is stable. It does not change with the operating system and platform. The user ID generally corresponds one-to-one to the real user and is the most accurate ID representing the user.

The second type is the anonymous ID, which is generally the cookies when the user is browsing web pages.

Cookies are randomly generated when a user visits a web page and do not require user registration or login. It is important to note that cookies generated by iOS and Android operating systems are limited to the internal system and are highly dependent on the user’s device or browser when browsing. Therefore, cookies generally do not contain personal information and can be deleted, so the accuracy is not as high as the user ID.

The third type is the device ID, which is bound to the device and cannot be changed once manufactured. Although the device ID cannot be erased, it cannot distinguish users if the user shares the Internet-connected device with family or friends. Therefore, the accuracy of the device ID is lower than that of the user ID.

The fourth type is the IP address, which is related to the actual geographical location and the network being used.

Even if the same user uses the same device to browse the Internet in different places, the IP address will be different. In addition, many users often share the same IP address in large Internet service providers. Therefore, the accuracy of the IP address is the lowest and is generally only considered when the user ID, anonymous ID, and device ID cannot be obtained.

These are the four experimental units at the user level. Their order of accuracy, from highest to lowest, is:

User ID > Anonymous ID (Cookies) / Device ID > IP address.

Why do I emphasize the accuracy of these four types of IDs? Because the higher the accuracy of the experimental unit, the higher the accuracy of the A/B test results.

Therefore, when we determine the user level as the experimental unit, if there is a user ID in the data, the user ID should be prioritized. If there is no user ID in the data, for example, if the user does not register and log in due to privacy concerns, or if the functionality of the tested web page does not require user registration and login, then the anonymous ID or device ID can be used. When none of these IDs are available, the IP address with the lowest accuracy can be considered.

Visit Level #

The visit level refers to considering each user’s visit as the smallest unit.

When we visit a website or app, there is a backend system that records our visit actions. So, how do we define the start and end of a visit?

The start of a visit is easy to understand, which is the moment when we enter the website or app. However, the difficulty lies in defining the end of a visit. During a visit, we may click on different pages, scroll up and down, left and right, and then exit. It is also possible that we only briefly visit without much interaction, or even without exiting and moving on to other pages or apps.

Therefore, considering the complexity of user visits, generally, if a user has no activity within 30 minutes on a website or app, the system considers that the visit has ended.

If a user visits frequently, there will be many different visit IDs. In A/B testing, if the visit level is chosen as the experimental unit, there may be a problem where a user appears in both the experimental group and the control group.

For example, if I visited the Geek Time app both today and yesterday, I would have two visit IDs. If the visit ID is chosen as the experimental unit, I may simultaneously belong to both the control group and the experimental group.

Page Level #

The page level refers to considering each new page view as the smallest unit.

Here, there is a keyword “new,” which means that even if the same page is viewed by the same person at different times, it will be considered as different pages. For example, if I first visit the homepage of Geek Time, then click into a column, and finally return to the homepage, if the page view ID is chosen as the experimental unit, the page view IDs of these two homepages may be assigned to different experimental groups or control groups.

At this point, we can compare and understand these three levels.

  1. The experimental units at the visit level and page level are more suitable for A/B testing where changes are not easily noticed by users, such as testing algorithm improvements or the effects of different advertisements, etc. If the changes are easily noticed by users, it is recommended to choose the experimental units at the user level.
  2. From the user level to the visit level and then to the page level, the granularity of the experimental units becomes finer, and consequently, more sample size can be obtained. The reason is simple: one user can have multiple visits, and one visit can include multiple page views.

By now, you may feel overwhelmed by this information. With so many units, how do you choose the appropriate experimental units in practice? Don’t worry, next, I will take you through a specific case of “increasing user retention rate by adding product features to a video app” to help you choose the suitable experimental units step by step.

A Case Study: How to Choose Experimental Units? #

A video app recently received a lot of user feedback, and a significant number of users expressed the desire to be able to watch videos even without a network connection or in areas with poor network connectivity. Consequently, the product manager wants to add an “offline download” feature to improve user retention.

Now, the product manager wants to conduct an A/B test to see if adding the “offline download” feature really enhances user retention. So, how should they select the experimental units?

If they use user-level IDs as experimental units (i.e., grouping each user as the smallest unit), it may not be possible to collect a sufficient sample size due to time constraints. Therefore, they need to find more granular experimental units to generate a larger sample size. Hence, they can choose either the visit level or the page level as experimental units.

By analyzing the data, the data analyst discovered that there are visit IDs but no pageview IDs. Therefore, they select the visit level and group each visit as the smallest unit, as a user can have multiple visits.

By doing this, the sample size becomes sufficient. However, after analyzing and calculating the experimental results, they find that the retention rate of the experimental group not only did not increase but also decreased compared to the control group.

This is quite strange. Could it be that the “offline download” feature caused a deterioration in user experience? Doesn’t this contradict the feedback from users before?

Thus, they conduct further interviews and research with these users, and the conclusion they obtain is indeed that the user experience has worsened. However, this is not because users dislike the addition of the new feature. So, what went wrong?

In fact, the problem lies in choosing inappropriate experimental units. In the previous experiment, they divided users into different groups based on each individual visit, causing the same user to be assigned to different groups due to multiple visits.

Therefore, when users were in the experimental group, they could use the new feature, but when they were assigned to the control group, they found that the new feature was missing, leaving them confused. It’s similar to suddenly losing a very useful feature that you were using yesterday—wouldn’t that be frustrating?

Therefore, when the business change is noticeable to users, I would recommend choosing user-level as the experimental unit.

In this case, if the sample size is insufficient, it is necessary to communicate with the business and clarify that more time is needed for testing due to inadequate sample size, rather than selecting units with a smaller granularity. If it is not possible to persuade the business to extend the testing period, we need to use other methods to compensate for the insufficient sample size’s impact on the experiment, such as increasing the proportion of traffic used in this A/B test among the overall traffic or using evaluation metrics with lower variability (I will discuss these methods in Lesson 9).

Looking back at this case study, can we extract some key experiences and pitfalls in choosing experimental units? Yes, we can summarize them into three principles:

  1. Ensure the coherence of the user experience.
  2. The experimental units should align with the units of evaluation metrics.
  3. The sample size should be as large as possible.

By mastering these three principles, you will be able to choose the best experimental units based on the specific circumstances!

Three principles for determining experimental units #

1. Ensuring the coherence of user experience

Ensuring that users have the best experience is almost a goal for all products, and the coherence of user experience is particularly important. The example of video apps tells us that if the changes in the A/B test are perceivable by the users, then the experimental units should be chosen at the user level.

Otherwise, if the same user appears in both the experimental group and the control group, they will experience different features and have different experiences. This incoherence in experience can confuse and frustrate users, easily leading to user attrition.

2. Keeping the experimental units consistent with the units of evaluation indicators

Why is this important? We need to understand from a statistical perspective.

One prerequisite for A/B testing is that the experimental units are independent and identically distributed (IID). If the units are inconsistent, it violates the assumption of mutual independence, undermines the theoretical foundation of A/B testing, and thus leads to inaccurate experimental results.

Let’s take an example. If we use A/B testing to evaluate the effectiveness of a music app’s push notifications for new albums, with the evaluation indicator being the percentage of users who listen to the new albums (number of users who listen to the new albums / number of users who receive the push notifications), this evaluation indicator is based on the user level, so the experimental units generally should also be at the user level.

Now, if we change the experimental units to the level of the new album page, since each user can browse this page multiple times, each page view from the same user is not independent for each visit, and the assumption of IID is violated. Therefore, the experimental results become inaccurate.

Hence, when selecting experimental units, you must remember that the experimental units in A/B testing should be consistent with the units of evaluation indicators.

3. Increasing the sample size as much as possible

In A/B testing, the larger the sample size, the more accurate the experimental results. However, there are many ways to increase the sample size, and we should never choose finer-grained experimental units solely for the purpose of obtaining more samples, without considering the first two principles.

Therefore, the third principle for selecting experimental units is: while ensuring the coherence of user experience and consistency of experimental units with the units of evaluation indicators, you can choose finer-grained experimental units to increase the sample size as much as possible.

Now we have finished explaining the three principles. Let me summarize for you: the first two principles must be considered and fulfilled, while the third principle is an additional consideration that can be taken under certain conditions.

Summary #

In this lesson, I have explained in detail the commonly used experimental units and their applicable scope in practice. I have also summarized the main factors to consider when selecting different units based on my practical experience, helping you truly understand and master the logic behind them, thereby assisting you in making correct judgments in future practices.

I have also provided a simplified decision tree for you to review and memorize:

In practice, the two most important points we need to consider are: the coherence of user experience and the consistency of experimental units and evaluation metric units. After all, the user is god, and maintaining a good user experience is applicable to all businesses/products. Therefore, for visible changes to users (such as UI improvements), most experiments consider the user as the smallest experimental unit (user ID/anonymous ID/device ID), as well as the unit for evaluation metrics.

If you want more sample size and the changes in the A/B test are not easily noticeable to users (such as improvements in recommendation algorithms), you can use finer-grained accesses or pages as experimental units. At the same time, it is important to keep the evaluation metrics consistent with the experimental units.

Thought Questions #

Do you always conduct A/B testing on a user basis? After completing this lesson, you can take a moment to reflect and consider if some A/B testing can be done using other units. Why?

Feel free to share your thoughts and ideas in the comments section, let’s have a discussion. If you have gained something from this, you are also welcome to share today’s content with your friends and progress together!