035 Machine Learning Ranking Algorithms Pairwise Method for Rank Learning

035 Machine Learning Ranking Algorithms Pairwise Method for Rank Learning #

In Monday’s article, I shared the most basic single-point learning to rank method. This approach is simple and practical, as it transforms the classical information retrieval problem into a key step in machine learning. To briefly recap, we introduced the use of NDCG (Normalized Discounted Cumulative Gain) in the test set to evaluate the “precision” and “recall” at a position K, thus assessing the ranking algorithm.

As you can see, there is still a significant gap between the pattern of the single-point learning to rank algorithm and the desired final result. This gap is not determined by the quality of the algorithm, but rather by the goal that the algorithm seeks to optimize - whether a single data point is relevant or not, and the structured difference between the optimal NDCG ranking of a group of results. This structured difference has spurred researchers to continuously consider whether there are other methods to optimize ranking algorithms.

Today, I will talk about the “pairwise learning to rank” method derived from the single-point learning approach. In contrast to attempting to learn the relevance of each sample, the basic idea of pairwise learning is to compare samples pairwise and learn ranking from these comparisons, bringing us one step closer to the ultimate goal.

History of Pairwise Ranking Learning #

When people realized the use of machine learning for ranking learning, starting with the relative relationship between documents, also known as pairwise ranking, it became a very hot research direction. The field of machine learning for ranking has been active for more than 10 years, during which many pairwise ranking algorithms have been proposed. Below, I will discuss several very popular algorithms.

Around the year 2000, researchers began using Support Vector Machines (SVM) to train ranking algorithms. Thorsten Joachims from Cornell University built RankSVM based on feature differences, which became a classic algorithm for pairwise ranking learning. As mentioned before, Joachims received the KDD Time Test of Time Award this year.

In 2005, researchers including Chao-Hui Zheng, who was working at Yahoo at the time, began using GBDT (Gradient Boosting Decision Tree) and other tree models to model pairwise relationships between documents. Zheng later became a co-founder of Yidianzhixun.

In 2005, researchers from Microsoft, including Chris Burges, started using neural networks to train RankNet, a ranking model for pairwise relationships between documents. This was an early attempt to use deep learning models for industrial applications. This paper received the 10-year Classic Paper Award at the 2015 International Conference on Machine Learning (ICML).

Detailed Explanation of Pairwise Ranking Learning #

Before introducing the central idea of pairwise ranking learning, let’s review the testing principle of the test set. Generally speaking, the testing principle is the same as the single-point method, which aims to examine whether the ranking of a group of documents for a certain query keyword on the test set is optimal.

For example, for a certain query keyword, we evaluate the top “K” documents generated by the ranking. First, we look at precision, which refers to how many of the documents judged as relevant by the algorithm are truly relevant. Then we look at recall, which refers to how many of the truly relevant documents are extracted. Of course, there is also the F1 value, which is the harmonic mean of precision and recall, an important indicator that balances precision and recall. It should be noted that precision, recall, and F1 value are all defined based on binary relevant information labels.

If we need to define the five-level relevant information, commonly known as “most relevant”, “relevant”, “uncertain”, “not relevant”, and “least relevant”, then we need evaluation metrics like NDCG. The assumption of NDCG is that in a ranking result, relevant information should be ranked higher than irrelevant information, with the most relevant information at the top and the least relevant information at the bottom. Any ranking result that deviates from this assumption will be subject to “penalties” or “punishment”.

After understanding the test set, let’s take a look back at the setting of the training set. At the beginning of today’s article, I mentioned the problem of the “undefined target” in ranking learning using single-point modeling. From the perspective of NDCG or from the perspective of precision or recall based on the top-K, it can be seen that for a query keyword, the most important thing is not whether the relevance of a single document is estimated accurately, but whether the “relative relationship” between a group of documents can be estimated correctly. As long as the relative relationship is estimated correctly, the final result will be accurate from the perspective of ranking. Understanding this viewpoint is crucial for a deeper understanding of the difference between ranking and ordinary classification.

So, how do we proceed further from single-point modeling?

Obviously, a key relationship in ranking is the comparison between every two documents, which is commonly referred to as pairwise relationship. Imagine if there is a perfect ranking relationship for a certain query keyword, and then through this perfect ranking relationship, we can deduce the pairwise relative relationships between the documents, and further use these relative relationships to rank other query keywords.

Note that in this framework, the training set samples have changed from “keyword-document pairs” to “keyword-document-document pairs”. In other words, each data sample represents a comparison relationship. For example, if there are three documents: A, B, and C, and the perfect ranking is “B > C > A”, we want to reconstruct “B > C > A” by learning the pairwise relationships “B > C”, “B > A”, and “C > A”.

There are several crucial assumptions in this approach.

Firstly, we can obtain a perfect ranking relationship for a certain keyword. In practical operation, this relationship can be obtained through five-level relevance labels or other information such as click-through rate. However, this perfect ranking relationship does not always exist. For example, in an e-commerce website, for the query keyword “Harry Potter”, some users may want to buy books, while others may want to buy T-shirts with Harry Potter patterns. Clearly, there is no perfect ranking in this case.

Secondly, we hope to learn the pairwise relationships between documents in order to “reconstruct” this perfect ranking. However, this is not a “guaranteed” approach either. Using the previous example, we hope to learn the pairwise relationships “B > C”, “B > A”, and “C > A” to reconstruct the perfect ranking “B > C > A”. However, in practice, these three pairwise relationships are independent. Especially when making predictions, even if the model can correctly judge “B > C” and “C > A”, it does not mean that the model can definitely obtain “B > A”. It is important to note that the key here is the word “definitely”, which means that the model may or may not obtain it. The pairwise relationships cannot “definitely” obtain the perfect ranking, which actually reveals the inconsistency of this method. In other words, we cannot guarantee that the optimal ranking can be obtained.

Thirdly, we can construct samples to describe these pairwise comparison relationships. One relatively simple situation is to consider the pairwise relationships between documents based on the differences in document features. In other words, the difference in features between samples can be used as new features, and a set of corresponding relationships from feature difference to relevance difference can be learned.

The RankSVM I mentioned earlier is such an approach. RankSVM is essentially SVM, i.e., support vector machine, but the modeling object has changed from a single document to document pairs. More complex models, such as GBRank, directly model the relationships between documents using the tree aggregation model GBDT, hoping to express the relevance difference between documents through the difference in function values.

It should be noted that pairwise ranking learning, especially during test set prediction, may encounter computational complexity issues. In principle, predictions must be made for all pairwise relationships. In practice, if sample construction is based on the linear difference in features, the testing can still regress to linear complexity. However, with other methods, it is not so fortunate. There are many computational speedup or approximation algorithms that provide possibilities for pairwise comparison sorting in practical applications.

Summary #

Today I talked to you about pairwise learning in the field of document retrieval based on machine learning. As you can see, the whole problem setting is fundamentally different from traditional text search techniques, but it takes a big step forward in modeling the relationship between documents compared to pointwise methods.

Let’s review the key points together: First, in the hot research of machine learning ranking, many pairwise learning algorithms have been proposed, such as RankSVM, GBDT, and RankNet. Second, the testing principles of pairwise learning are consistent with those of pointwise methods. We can examine precision, recall, F1 score, or use five-point relevant information. Third, to address the “unclear objective” problem of pointwise learning in ranking, pairwise learning has a different training set setting. On this basis, I introduced three key assumptions.

Finally, I’ll leave you with a question to ponder: Is there any way to combine pointwise and pairwise methods?

References

  1. Zhaohui Zheng, Keke Chen, Gordon Sun, and Hongyuan Zha. A regression framework for learning ranking functions using relative relevance judgments. Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval, 287-294, 2007.
  2. Thorsten Joachims. Optimizing search engines using clickthrough data. Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining, 133-142, 2002.