036 Machine Learning Ranking Algorithms Listwise Method for Rank Learning

036 Machine Learning Ranking Algorithms Listwise Method for Rank Learning #

This week, we have discussed two fundamental approaches to ranking learning: pointwise learning to rank and pairwise learning to rank. Pointwise learning to rank is a simple and practical approach that aims to transform classical information retrieval problems into machine learning problems. Pairwise learning to rank, on the other hand, transforms the ranking problem into a modeling problem that focuses on the relative relevance between every two documents for a given query keyword. However, both approaches have obvious limitations and require further algorithm optimization to achieve our desired final goal.

Today, I will talk about the “ultimate method” for directly optimizing ranking problems: listwise learning to rank. Instead of attempting to learn the relevance of each individual sample or the relative comparison between two documents, the basic idea of listwise learning to rank is to directly optimize metrics like NDCG (Normalized Discounted Cumulative Gain) in order to learn the best ranking results.

History of Listwise Sorting in Machine Learning #

After the year 2000, both the academic and industrial communities began researching how to use machine learning to solve the problem of optimal sorting. Five or six years later, researchers started to explore directly optimizing the entire sorting list.

Much of this research comes from Microsoft Research. For example, AdaRank, which emerged around 2007, was developed by Xu Jun and Li Hang from Microsoft Research Asia. This paper can be considered as one of the early research works proposing the idea of listwise sorting. In the same year, ListNet, which opened the door to listwise sorting theoretically, was published at the International Conference on Machine Learning (ICML 2007) by Liu Tieyan and others from Microsoft Research Asia. Similar research works emerged like mushrooms after rain during this year.

In addition, I will mention another direction. LambdaRank appeared slightly earlier, while LambdaMART appeared later. This work was developed at Microsoft Research in Seattle, led by Christopher J.C. Burges. Burges retired in 2016 after working at Microsoft for 16 years, and it can be said that the algorithms for Microsoft’s search engine Bing were invented by his team.

Detailed Explanation of Listwise Ranking Learning #

There are two basic approaches to listwise ranking learning. The first approach is to directly optimize metrics such as NDCG. The objective is clear: optimize the target that is used for evaluation. The second approach is to reconstruct a known optimal ranking and measure the difference with this reconstruction.

Let’s first discuss the first approach, which is to directly optimize metrics like NDCG.

First, let’s review the testing principle of the ranking test set. In general, all ranking-based metrics examine whether a certain group of documents forms an optimal ranking for a specific query keyword in the test set. There are two common approaches. The first method is suitable for binary relevance information. For a specific query keyword, it evaluates the “top K” documents generated by the ranking by examining metrics such as precision and recall. The second method uses graded relevance information, which allows the use of evaluation metrics like NDCG. For a detailed interpretation, please refer back to the content we covered in the previous two episodes.

So, what are the challenges and key points in directly optimizing ranking metrics like NDCG?

The difficulty lies in the fact that although optimizing NDCG is desirable, it is far from easy in reality. NDCG and the precision based on the “top K” I mentioned earlier are “non-continuous” and “non-differentiable” in mathematical terms. However, the majority of optimization algorithms are based on “continuous” and “differentiable” functions. Therefore, direct optimization is quite challenging.

To address this situation, there are several methods:

The first method is to find an alternative metric that approximates NDCG. This alternative metric should be “continuous” and “differentiable”. As long as we establish an approximation relationship between this alternative metric and NDCG, we can optimize this alternative metric to achieve the goal of approximating NDCG optimization. Representative algorithms in this category include SoftRank and AppRank.

The second method is to try to mathematically derive a “bound” for NDCG and other similar metrics, and then optimize this bound. For example, if an upper bound is derived, NDCG can be optimized by minimizing this upper bound. Representative algorithms in this category include SVM-MAP and SVM-NDCG.

The third method is to explore optimization algorithms in order to design complex optimization algorithms that can optimize NDCG and other similar metrics. For such algorithms, the objective function required by the algorithm can be “non-continuous” and “non-differentiable”. Representative algorithms in this category include AdaRank and RankGP.

After discussing the first approach, let’s take a look at the second approach. The main assumption in this approach is that we already know a perfect ranking for a specific search keyword. The goal is to use a learning algorithm to approximate this perfect ranking. We want to minimize the difference between the predicted ranking and the perfect ranking. It is worth noting that optimizing NDCG and other ranking metrics is not the main objective in this approach. Representative methods in this approach include ListNet and ListMLE.

After discussing these two approaches, let me briefly mention the third approach. The characteristic of this approach is to seek a middle ground between pure listwise and pairwise methods. Specifically, the core idea of this approach is to design an alternative objective function inspired by metrics like NDCG. This step is similar to the first direction I introduced earlier, which is to find alternatives.

In a further step, after finding alternatives, the idea of directly optimizing the list is simplified to optimizing some kind of pairing. Representative methods in this direction are LambdaRank invented by Microsoft and its successor LambdaMART. The LambdaRank series of models’ basic idea is what I will briefly explain here.

First, Microsoft researchers noticed that whether a ranking algorithm achieves the optimal scenario can be viewed as checking which pairwise document relationships are incorrect in the current ranking compared to the optimal one. The problem of learning the optimal ranking is transformed into reducing these pairwise ranking errors. Furthermore, in the design of this optimization process, we actually do not need to know the exact form of the target function. We only need some form of gradient.

There is an insight here: for the majority of optimization processes, the objective function often exists only for deriving gradients. If we directly obtain the gradients, we do not need the objective function. Finally, through experiments, Microsoft researchers multiplied the difference between the NDCG and the gradient by this gradient to achieve the enhancement effect.

In the early days, LambdaRank, especially RankNet, used neural networks for model training. LambdaMART, on the other hand, adopted the idea of “ensemble decision trees” and switched to a method based on decision trees. Later, it was proven that decision tree-based methods are very effective for ranking problems, and they have become the standard configuration for many similar methods.

Finally, one thing to note is that we discussed different approaches to listwise ranking learning. In theory and research, listwise methods are considered ideal ranking learning methods because they attempt to unify ranking evaluation metrics and learning objectives. Although pure listwise methods perform well in academic research, hybrid methods like LambdaRank, which combine pairwise and listwise methods, are more popular in practice. This is because, overall, listwise methods have higher computational complexity, and their actual advantages are not significantly large in industrial-level applications. Therefore, the main contribution of listwise methods at present is mainly in terms of academic value.

Summary #

Today I talked to you about the listwise approach in learning to rank. You can see that there are many different ideas in listwise learning to rank, and it was a very active research field between 2000 and 2010, accumulating a large number of achievements.

Let’s review the key points together: First, a brief introduction to the history of listwise learning to rank. Second, a detailed introduction to the three main approaches of listwise learning to rank and the main details and methods within each approach.

Finally, I will leave you with a question to ponder: Does the listwise approach completely solve the problem of sorting algorithms?

References

Jun Xu and Hang Li. AdaRank: a boosting algorithm for information retrieval. Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval, 391-398, 2007.
Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, Hang Li. Learning to rank: from pairwise approach to listwise approach. ICML, 129-136, 2017.
Q. Wu, C.J.C. Burges, K. Svore and J. Gao. Adapting boosting for information retrieval measures. Journal of Information Retrieval, 2007.
C.J.C. Burges, R. Ragno and Q.V. Le. Learning to rank with non-smooth cost functions. Advances in Neural Information Processing Systems, 2006.
C.J.C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton and G. Hullender. Learning to rank using gradient descent. Proceedings of the twenty second international conference on machine learning, 2005.
F. Xia, T.-Y. Liu, J. Wang, W. Zhang, and H. Li. Listwise approach to learning to rank — Theorem and algorithm. ICML, 1192–1199, 2008.
S. Chakrabarti, R. Khanna, U. Sawant, and C. Bhattacharyya. Structured learning for non-smooth ranking losses. SIGKDD, 88–96, 2008.
T. Qin, T.-Y. Liu, and H. Li. A general approximation framework for direct optimization of information retrieval measures. Technical Report, Microsoft Research, MSR-TR-2008-164, 2008.
M. Taylor, J. Guiver, S. Robertson, and T. Minka. SoftRank: Optimising non-smooth rank metrics. WSDM, 77–86, 2008.
J.-Y. Yeh and J.-Y. Lin, and etc. Learning to rank for information retrieval using genetic programming. SIGIR 2007 Workshop in Learning to Rank for Information Retrieval, 2007.
Y. Yue, T. Finley, F. Radlinski, and T. Joachims. A support vector method for optimizing average precision. SIGIR, 271–278, 2007.