040 What Are the Fundamental Indicators for Search System Evaluation

040 What are the Fundamental Indicators for Search System Evaluation #

In my previous columns over the past few weeks, I mainly discussed the classic Information Retrieval techniques and machine learning-based ranking algorithms (Learning to Rank). I also talked about how to understand query keywords, including query keyword classification, query keyword parsing, and query keyword expansion. These classic techniques are the core technologies behind various search engines that have become popular since the year 2000.

Before diving into more search engine technologies, I think it is necessary to spend a week specifically looking at the evaluation of search systems and the various metrics we often use. As the saying goes, “If You Can’t Measure It, You Can’t Improve It.” This means that if we can’t measure the quality of a system, if we don’t have corresponding evaluation metrics, it’s difficult to truly understand how to improve these metrics and ultimately enhance the system.

Although here we are discussing the evaluation and metrics in the context of search systems, many of the details we will discuss this week can be applied to similar scenarios. For example, the recommendation systems, advertising systems, and other similar scenarios can seamlessly utilize many of the contents we will cover this week.

Offline Evaluation #

Assuming you have developed a new software today, such as the latest mobile app, how do you know if your users like your software? How do you know if your users are willing to pay for your software?

The core of evaluation is actually understanding users’ preferences. The most direct way, of course, is to directly ask users for feedback. For example, you can conduct surveys for every user who downloads your mobile app, asking about their attitudes towards the new software.

However, we quickly realize that this method is not feasible. Apart from whether users will be resentful due to this forced approach, there is a question of whether we can obtain users’ true feedback through these surveys. This involves the scientific aspect of survey design.

Even if these surveys can accurately reflect users’ opinions of the mobile app, implementing them would face various difficulties. If the number of users of this mobile app is in the millions or even tens of millions, conducting such a large-scale survey and processing the data afterwards would be a huge workload. Furthermore, these surveys cannot be reused because users’ attitudes will change after the next version of the software update, making it difficult to systematically assist software iteration.

So, how do we create a set of data to help with iterative improvements and reduce manual costs? This becomes a key question.

In the early days of information retrieval system development, researchers and engineers recognized the importance of this core issue. British computer scientist Cyril Cleverdon can be considered one of the earliest developers of offline test collections.

Cyril was born in 1914 and worked at the Bristol library in the UK for a long time. Starting from 1950, Cyril dedicated himself to developing information retrieval systems to improve the efficiency of library queries. In 1953, he attempted to establish a small test dataset to measure the speed at which librarians could find documents. This work was first published in a 1955 paper (see reference [1]).

Afterwards, the development of some early information retrieval systems in the UK and the US followed this approach, which is to compare multiple systems by first constructing an offline test collection and then using this collection to continuously improve and enhance existing systems. If you want to understand the construction and information of early test collections, it is recommended to read the references at the end of this article [2].

So, what are the characteristics of these test collections created at that time?

These test collections would all include a set of query keywords. This set contains dozens to hundreds of keywords. On the one hand, the selection of these keywords mostly comes from experience. On the other hand, starting from Cyril, it was realized that it is necessary to ensure that some information can definitely be found using these keywords. In fact, this is testing what we will discuss later: “recall”.

After having these query keywords, these test collections often contain hundreds to thousands of documents. When constructing the dataset, researchers already know that a certain part of these documents will contain the information needed for the corresponding query keywords, which is what we will call relevant documents later.

As you can see, dozens to hundreds of query keywords and thousands of documents obviously cannot represent the behavior of all possible users of a system. You can even say that this cannot represent the behavior of the vast majority of users. However, the advantage of this type of test collection is that the query keywords and documents themselves are independent of the system being tested. In other words, whether we want to test system A today or system B tomorrow, we can repeatedly use the same test collection. The benefits of doing this compared to the previously mentioned surveys are evident.

Furthermore, I need to emphasize that the concept of “users” is “abstracted” from the test collection. When we discuss the relevance of a document to a specific query keyword, we assume that this relevance is constant and applicable to all users. Therefore, it doesn’t matter which user is using the system. As long as the developed system performs well on this “standardized” set of query keywords and documents, we believe that this system can meet the needs of all users.

Because the test collection is not the real feedback result generated by user-product interactions, we often refer to the test collection as “offline evaluation data”.

Evaluation Metrics Based on Binary Relevance #

After collecting evaluation data offline, the easiest thing we can do is to use a set of evaluation metrics defined by “binary relevance” to measure the quality of the system at hand.

So what is “binary relevance”? Simply put, it refers to the fact that for a given query keyword, each document in the entire test set is labeled either as “relevant” or “non-relevant”. In this case, there is no percentage of relevance. Each document has different relevant information for different keywords.

Assuming that a system extracts a certain number of documents from the test dataset for a particular keyword instead of returning all documents, we can define a series of metrics based on this extracted subset of documents.

Two metrics defined on “binary relevance” serve as the foundation for many other important metrics. One is called “precision”, which measures how many of the extracted documents are relevant. The other is “recall”, which measures how many of the relevant documents were extracted.

Both “precision” and “recall” have the same numerator, which is the number of documents that were both extracted and relevant. What differentiates these two metrics is their denominator. The denominator for “precision” is the total number of extracted documents, while the denominator for “recall” is the total number of relevant documents. If we were to return all documents, “recall” would be 1, and “precision” would be the proportion of relevant documents in the entire set. Therefore, we notice that both of these metrics assume that the number of extracted documents is relatively small compared to the whole set.

Soon, people realized from practice that “precision” and “recall” are like “trying to have one’s cake and eat it too”. It is difficult for a system to achieve high values for both “precision” and “recall”. In other words, we often need to find a balance between these two metrics. As a result, researchers began to look for a single number to represent the “average level” of “precision” and “recall”. C. J. van Rijsbergen, a scholar from the United Kingdom, first used the “harmonic mean” to compute the average of “precision” and “recall” in a paper (refer to [3]). This method later became known as the “F-measure” and has been used ever since.

It is important to note that “precision” and “recall” are both based on “binary relevance”. Therefore, these two metrics are not “ranking metrics”. In other words, they do not truly evaluate ranking systems.

For example, if we extract 10 documents for a specific keyword and 3 relevant documents are retrieved, neither “precision” nor “recall” can determine the position of these three documents in the final sequence, whether they are in the top three or the last three. Unfortunately, “precision” and “recall” cannot solve this problem. This limitation of “binary relevance” has guided researchers to develop metrics that can truly evaluate ranking.

Summary #

Today, I have covered the evaluation of search systems and common metrics. This is a crucial aspect of modern search technology, determining how we assess the systems we build. We have discussed the origins of offline testing and the advantages it holds over surveys.

Let’s recap the key points: Firstly, we provided a brief introduction to the history of reusable offline test collections, discussing their characteristics and limitations. Secondly, we delved into two popular and important evaluation metrics based on “binary relevance,” namely “precision” and “recall.”

Finally, I leave you with a question to ponder: We have learned that the quality of ranking cannot be determined solely by the numerical values of “precision” and “recall.” Can we manipulate these metrics in any way? If we solely rely on “binary relevance,” is there a method to evaluate the quality of ranking?

References

  1. R.G. THORNE, B.Sc., A.F.R.Ae.S. The Efficiency Of Subject Catalogues and The Cost of Information Searches. Journal of Documentation , Vol. 11 Issue: 3, pp.130-148, 1995.
  2. K. SPARCK JONES, C.J. VAN RIJSBERGEN. Information Retrieval Test Collections. Journal of Documentation , Vol. 32 Issue: 1, pp.59-75, 1976.
  3. C.J. VAN RIJSBERGEN. Foundation of Evaluation. Journal of Documentation , Vol. 30 Issue: 4, pp.365-373, 1974.