042 How to Evaluate the Online Performance of Search Systems

042 How to Evaluate the Online Performance of Search Systems #

In my previous two articles earlier this week, I explained to you the offline evaluation metrics based on the principles of “binary relevance” and “graded relevance”. Using these metrics, researchers have developed generations of search systems over the past half century, and these metrics and systems have been continuously evolving.

Although the metrics we discussed this week are all very informative, most of them were proposed based on offline static datasets, rather than truly detecting the interaction between users and systems (although later on, researchers have also used these evaluation tools directly for online evaluation, which has caused some issues in their usage). So, what methods are available to evaluate the online performance of a search system?

To answer this question, let’s explore a few topics related to conducting online evaluations today.

Online Controlled Experiments #

Let’s first go back to the original intention of the entire evaluation criteria. Why do we need offline testing?

The first reason is that during the development of information retrieval systems (also known as the early search systems), it was difficult to conduct online controlled experiments. The developers had not yet developed reliable means to determine user behavior. Therefore, during that era, the more reliable methods were surveys and later developed offline static evaluations. It can be said that these methods were “proxies” to truly understand user behavior.

To conduct evaluations, whether offline or online, another reason is that we need a means to distinguish the quality of two systems in order to continuously improve the systems through this means, achieving data-driven improvements.

So, the tool that can correctly observe the differences between two systems is “online controlled experiments,” sometimes called “online experiments,” or “online A/B experiments.”

Online controlled experiments are actually an important tool for establishing “causal relationships” and can be said to be the only completely reliable tool. The foundation of this tool is statistical hypothesis testing.

Specifically, we divide the population of website or application users into groups. In general, this is done through random assignments, where 50% of users are assigned to a defined group called the “control group,” and the other 50% are assigned to another group called the “treatment group.” The only difference between the “control group” and the “treatment group” is the system they are exposed to.

Suppose there is a search system, and we want to improve a certain part of it. We can keep the other parts unchanged and make the part we want to improve the only “independent variable” in the entire experimental setup. In this way, we hope to see whether we can determine through online experiments and hypothesis testing whether this “independent variable” will bring improvements or decline in system performance.

There is also a need to determine the metrics to be evaluated, especially user metrics such as website click-through rate, number of searches, and so on. These metrics are called “dependent variables.” In simpler terms, we hope to establish a relationship between the “independent variable” and the “dependent variable” through hypothesis testing.

Although the concept of online controlled experiments is easy to understand, there are many challenges in practical implementation.

Ideally, we can evenly split the users into the “control group” and the “treatment group.” However, in reality, the users flowing through these two groups via random algorithms may not present the same characteristics. What does this mean?

For example, in the “control group,” there may be more female users compared to the “treatment group,” or in the “treatment group,” there may be more users from Beijing. In such cases, differences in “dependent variables,” such as website click-through rates between the “control group” and the “treatment group,” become difficult to explain solely as differences between the “independent variables.”

In other words, if the click-through rate in the “control group” is higher than that in the “treatment group,” is it because we changed a certain part of the system or because of the additional female users or due to the complex interaction between female users and certain parts of the system? This becomes harder to determine. The same applies to the example mentioned earlier of more users from Beijing.

Of course, in reality, even if we can easily control one or two additional variables through algorithms to make their distribution in the “control group” and “treatment group” equal, it is still challenging to achieve an equal distribution of more than a dozen important variables (e.g., age, gender, geographical location, income level, etc.).

Even if we can achieve an equal distribution of known variables between the two groups through random algorithms, we still cannot control unknown variables in the same way. Therefore, dealing with the impact of population characteristics on conclusions is one of the challenges of online experiments in reality.

The second challenge of online experiments is that it is difficult to make a particular part of the system the only “independent variable” in the “control group” and “treatment group,” even after removing the previously mentioned disparities in population characteristics.

In modern websites or applications, there are many services, subsystems, pages, and modules that serve the entire website. And these services, subsystems, pages, and modules have different frontend and backend systems and may belong to different product and engineering teams. Each part hopes to conduct its own controlled experiment and make the part it wants to improve the only variable change as the “independent variable.” However, from a macroscopic perspective, if each part is conducting its own experiment while still considering each user as the basic unit of the experiment, it becomes difficult to ensure comparability among users.

For example, if User U1 enters the “control group” of the homepage and then visits the “treatment group” of the search page before leaving the website. And User U2 goes directly to the “treatment group” of the help page, then visits the “control group” of the search page. The difference in click-through rate between User U1 and User U2 is difficult to draw conclusions from their website browsing processes. Even with a large amount of data, it is difficult to truly balance the relationship among the groups of users on all these pages.

In reality, how to effectively conduct online experiments, including experimental design and evaluation, is still a very cutting-edge research topic. There are many new research results published each year at academic conferences such as KDD, WSDM, and ICML.

Using causal inference for analyzing experimental results #

Today, I would like to talk about causal inference. Causal inference is not a topic covered in ordinary statistics textbooks, nor is it included in the statistical content that engineering students typically encounter. However, in recent years, this field has received increasing attention from the machine learning community. Therefore, understanding causal inference is essential for learning cutting-edge knowledge in machine learning.

As mentioned earlier, various experiments can generate uneven user characteristics, and there may be relationships between experiments. In these aspects, we can use various tools of causal inference for analysis. Additionally, for engineering products, not all situations can be tested by A/B testing to find reasonable results for a desired content, model, or product design within a certain period of time. In many cases, testing is not feasible. Therefore, in situations where testing is not possible, it is still possible to obtain the expected results through data analysis. In other words, we can simulate online experiments. This is the core value of causal inference.

Generally speaking, the need for causal inference in machine learning is also quite common. For example, we need to use data to train new models or algorithms, and this data is collected from the current online system. However, the current online system has certain biases, which are then recorded in the data. In this case, causal inference brings a series of tools to machine learning, enabling training and evaluation of models and algorithms in a biased data set in an unbiased manner.

Summary #

Today, I discussed how to use online experiments, particularly controlled experiments, to evaluate the search technologies we build in modern search systems.

Let’s review the key points: First, I provided a detailed introduction to some factors in online experiments and analyzed the potential issues of user imbalance and experimental interactions. Second, I briefly mentioned the use of causal inference for analyzing online experiment data and a approach to “bias” adjustment.

In conclusion, I leave you with a question to ponder: How can we establish the relationship between online experiment evaluation results and offline metrics such as nDCG?