041 What Are the Advanced Indicators for Search System Evaluation

041 What are the Advanced Indicators for Search System Evaluation #

On Monday, we introduced the offline evaluation metrics based on the principle of “binary relevance”. It can be said that since 1950, this method has dominated the development of document retrieval systems. However, the “binary relevance” principle fundamentally does not support evaluation based on ranking, which has become an obstacle to developing more accurate ranking algorithms. As a result, researchers have developed evaluation metrics based on the principle of “graded relevance”. Today, I will focus on introducing the content in this area.

Evaluation Based on the Principle of Multi-grade Relevance #

Starting from “binary relevance”, it is natural to define a more flexible measure of relevance. In a paper published at NIPS 2007 (reference [1]), Yahoo scientists introduced a five-point scale relevance evaluation system, ranging from highly relevant to highly irrelevant. In the same year, Google scientists also published a paper at SIGIR (reference [2]), introducing their “multi-grade” relevance scoring mechanism. Since then, evaluation standards based on the principle of “multi-grade relevance” have been gradually accepted by developers of various search systems.

In this trend, “precision” and “recall”, based on “binary relevance”, become no longer applicable. We need new evaluation metrics based on “multi-grade relevance”.

Finnish scientists presented a method for computing relevance evaluation at SIGIR in 2000 (reference [3]). This method has been widely applied in the scenario of “multi-grade relevance”. So, what is this method invented by Finnish scientists?

This method is called “Discounted Cumulative Gain” (DCG for short). Before introducing DCG, let’s assume that position 1 is the top-ranking position, i.e., the document at the top, and the position number increases as the ranking decreases, with position 10 being the last document on the first page.

The idea of DCG is as follows.

First, the overall relevance of a ranking is a weighted sum of the relevances at each position. This describes the entire ranking with a single number. As long as the ranking result is different, this number will be different. Therefore, this avoids the insensitivity of “precision” or “recall” to ranking.

Second, the “gain” at each position is related to the relevance defined for the document, but with different “discounts” based on different positions. The lower the position (i.e., the larger the position number), the greater the discount. That’s where the name DCG comes from.

In the original definition of DCG, the “discount” is the relevance of the document divided by the logarithm of the position. This ensures that the lower the position (the larger the position number), the greater the discount. It also expresses that the difference between high positions (lower position numbers) should be greater than the difference between low positions. What does this mean? It means that if a document moves from position 1 to position 2, the discount (or loss) it suffers should be greater than if it moves from position 9 to position 10. With this definition, DCG encourages placing relevant documents as high as possible in the sequence.

In fact, assuming we have 5 documents with relevance scores of 1, 2, 3, 4, and 5, representing “least relevant”, “irrelevant”, “neutral”, “relevant”, and “most relevant” respectively. Then, according to the definition of DCG, the optimal ranking should order these 5 documents in order of relevance, i.e., 5, 4, 3, 2, 1. Any other order would result in a relatively smaller DCG due to the “discounted gain” defined per position, thus not being optimal. DCG can better express the evaluation of ranking compared to “precision” and “recall”.

However, directly using DCG has a problem. If we have two query keywords with different numbers of returned documents, it is not “fair” to directly compare the DCG values of these two query keywords. The reason lies in the “additive” property of DCG, where the result will only increase. Therefore, the DCG values of different query keywords cannot be directly compared.

Is there a solution? The indicator that normalizes DCG is called nDCG (Normalized Discounted Cumulative Gain). The idea of nDCG is as follows.

First, for a certain query keyword’s ranking, based on the relevant information, calculate the DCG value corresponding to a set of “ideal rankings”. The ideal ranking is often sorted in descending order based on relevance. Then, divide the DCG value produced by the current algorithm ranking by the ideal DCG value to obtain the “normalized” DCG, which is the nDCG value we talked about. Simply put, nDCG normalizes DCG relative to the ideal state. After nDCG normalization, we can compare the values of different query keywords.

It should be noted that the DCG definition we introduced above is the original one. Later, Microsoft scholars invented another variation of DCG around 2005, where the basic principle remained unchanged, but with some algebraic transformations in the numerator and denominator. This new version has been more widely used in the industry. If you are interested, you can refer to reference [4] at the end for more details.

Even today, nDCG and DCG are still the standard metrics for evaluating ranking algorithms and various ranking results.

Comparing Two Different Sortings #

Whether it’s the “precision” and “recall” we discussed before, or the nDCG we introduced today, we use a “number” to describe the quality of a set of results relative to a query keyword. When we have multiple query keywords, how do we compare the results of two different sortings?

One issue here is that for two different sortings, A and B, they may have their own merits. Maybe sorting A performs better than B for one keyword, but sorting B produces better results than A for another keyword. What should we do in this case?

You may think of using the average to describe the performance of A and B. This is indeed a good first step. So we calculate the average performance of A and B. In this way, we have two numerical values to express the quality of these two sortings.

However, soon we will encounter a problem. Suppose the average nDCG value for A is 0.781 and for B is 0.789. Can we conclude that B is a better sorting algorithm than A?

The answer is, of course, not necessarily. In such cases, we rely on the statistical tool “hypothesis testing” to evaluate the quality of the two sortings.

I won’t go into the details of hypothesis testing here, but I’ll briefly mention a frequently used tool.

If we compare A and B on the same set of query keywords, we often use the “two sample paired T-test”. The term “paired” here means that the results of A and B can be compared one-to-one. The “T-test” actually refers to using the “T-distribution” or what we commonly call the “Student’s distribution” to perform hypothesis testing. If we compare the two sortings on different sets of query keywords, there are other hypothesis testing tools, but I won’t go into that here.

It is worth noting that hypothesis testing itself is not a panacea. First, the most effective way to perform hypothesis testing on sorting results is still a research topic, and all methods, including the “two sample paired T-test” mentioned above, are not absolute standards. Second, the conclusions obtained from hypothesis testing only represent statistical significance of “good” or “bad”, and there may still be significant differences in the performance of these systems in front of users. Therefore, the results of hypothesis testing should also be interpreted with a critical eye.

Summary #

Today I have explained how to evaluate the systems we build in modern search technology, especially how to evaluate ranking systems.

Let’s review the key points together: First, a brief introduction to the evaluation system based on “degree of relevance”, including its origin and the concepts of DCG and nDCG. Second, a detailed explanation of how to compare the quality of two rankings.

Finally, I leave you with a question to ponder: If we only have “binary” relevance information, can we use nDCG to evaluate the quality?

References

Ben Carterette and Rosie Jones. Evaluating search engines by modeling the relationship between relevance and clicks. Proceedings of the 20th International Conference on Neural Information Processing Systems (NIPS’07) , J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis (Eds.). Curran Associates Inc., USA, 217-224,2007.
Scott B. Huffman and Michael Hochster. How well does result relevance predict session satisfaction? Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR ‘07). ACM, New York, NY, USA, 567-574, 2007.
Kalervo Järvelin and Jaana Kekäläinen. IR evaluation methods for retrieving highly relevant documents. Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR ‘00). ACM, New York, NY, USA, 41-48, 2000.
Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. Learning to rank using gradient descent. Proceedings of the 22nd international conference on Machine learning (ICML ‘05). ACM, New York, NY, USA, 89-96, 2005.