034 Machine Learning Ranking Algorithms Single Factor Method for Rank Learning

034 Machine Learning Ranking Algorithms Single-Factor Method for Rank Learning #

In this column, we have already explained the most classical information retrieval techniques. These techniques provide basic algorithm support for search engines before the year 2000. Whether it’s TF-IDF, BM25, or language models, these methods and their variants continue to play a role in many fields (not limited to text).

However, since the idea of machine learning gradually infiltrated fields such as information retrieval, a most intuitive thought is how to use machine learning to improve the performance of information retrieval. This idea guided the research in this field from 2000 to 2010 and gave rise to various machine learning-based ranking algorithms, which also led to the maturity and development of search engine technology.

Today, I will start with the simplest and most practical type of machine learning ranking algorithm, which is Pointwise Learning to Rank. This type of method is very practical in industry, widely applied, and also shows robust performance in practical results. At the same time, understanding this type of model can lay the foundation for learning complex ranking algorithms later on.

History of Single Point Method for Sorting Learning #

As early as 1992, German scholar Norbert Fuhr attempted to use machine learning to estimate search system parameters in a paper. Three years earlier, he also published a paper using “polynomial regression” to achieve a similar method. For his long-term contributions in the field of information retrieval, Norbert was awarded the Gerard Salton Award by the Association for Computing Machinery (ACM) in 2012.

In 1992, a group of scholars from the University of California, Berkeley published a paper in SIGIR, using “logistic regression” classifiers for learning sorting algorithms. This can be considered as the earliest attempt to solve sorting algorithm learning using machine learning. However, due to the early stage of machine learning and information retrieval research at that time, these early results were not satisfactory.

After 2000, support vector machines gradually gained popularity in both industry and academia, and the use of machine learning for sorting algorithm training came back into people’s attention. Search engines became an important battlefield during the first dot-com bubble, and various search engine companies started using machine learning to improve the accuracy of search results. This trend opened up a decade of research and development in machine learning sorting algorithms.

Introduction to Pointwise Learning to Rank #

To understand Pointwise Learning to Rank, we first need to understand some basic concepts. These concepts can help us transform a sorting problem into a machine learning problem, specifically in a supervised learning setting.

Traditional search ranking algorithms such as TF-IDF, BM25, and language models are examples of unsupervised learning ranking algorithms. The algorithms themselves do not know in advance which documents are “relevant” to which keywords. In essence, these algorithms are a process of “guessing” relevance. Therefore, traditional information retrieval has developed a series of theories to score each “query-document pair” in the hope that such scores reflect relevance.

However, from the perspective of modern machine learning, it is clear that such ranking algorithms are not optimal, especially when relevant information exists. It is possible to use this relevant information directly to improve the accuracy of the ranking algorithm.

To train a ranking algorithm as supervised learning, let’s first consider what kind of dataset is needed. The object we want to model is a pairing of every query keyword with all documents. In other words, each training sample must at least include information about the query keyword and a specific document. Using the relevance, we can define the label for each training sample.

In a highly simplified case, if we define the label as whether a document is relevant to a specific keyword, i.e., binary labels, the problem of training a ranking algorithm is transformed into a binary classification problem. In this case, any existing binary classifier can be used directly for training the ranking algorithm without modification. For example, classical “logistic regression” classifiers or support vector machines are good choices.

We call this method “Pointwise Learning to Rank” because each training sample is only a pairing of a specific query keyword with a specific document. Whether they are relevant or not does not depend on any other documents or keywords. In other words, our learning algorithm treats the relevance of a document to a keyword in isolation, rather than considering the problem in a correlated manner. Obviously, Pointwise Learning to Rank is a significant simplification of reality, but it is a good starting point for training ranking algorithms.

Once we know how to build a training set, let’s take a look at the test set and focus on how to evaluate the effectiveness of the ranking algorithm. The data in the test set is similar to the training set, consisting of “query keyword and document pair” samples. The labels are also relevance information for this pairing. As mentioned earlier, if this is a binary relevance information, evaluating the ranking algorithm becomes evaluating a binary classification problem.

For a binary classification problem, there are two main evaluation metrics: First, precision, which refers to the proportion of relevant documents among all the documents that the classifier has determined to be relevant; Second, recall, which refers to the proportion of truly relevant documents that have been extracted.

Since this is a ranking problem, there is a difference from a regular binary classification problem, namely the Top-K problem. What does this mean? It means that for a specific query keyword, we are not evaluating all documents, but only evaluating the top K documents after sorting. In this context, both precision and recall are defined based on this value of K. Without the restriction of this K, in the case of all data, precision and recall both revert to “accuracy,” the most basic evaluation measure for classification problems.

In practical applications, the value of K is often very small, such as 3, 5, 10, or 25, while the number of documents that may be scored is huge. Theoretically, any document may be a potentially relevant object for any query keyword. Therefore, when evaluating ranking algorithms, using this K is a crucial method of simplification.

In addition to precision and recall, the information retrieval community also prefers to use the F1 measure to evaluate ranking algorithms. Simply put, the F1 measure is the “harmonic mean” of precision and recall. In other words, the F1 measure combines precision and recall and provides a unique value to balance these two metrics. It should be noted that in many practical situations, precision and recall are like a pair of conflicting measures that cannot be simultaneously achieved. Therefore, the emergence of the F1 measure makes it more convenient to balance these two potentially conflicting metrics.

The evaluation I mentioned earlier mainly refers to binary relevance information. However, relevant label information can actually be defined as richer multi-dimensional relevance information. For example, instead of only focusing on whether a document is relevant to a query keyword, we provide a score indicating the degree of relevance, ranging from “most relevant,” “relevant,” “uncertain,” “irrelevant,” to “most irrelevant,” a total of five levels of definition. Under this definition, at least two additional methods for evaluating ranking algorithms have been derived.

We can use the evaluation method of multi-class classification, treating the five levels of relevance as five different labels, to examine the accuracy of the classifier. Of course, this evaluation method is problematic for ranking because the amounts of data corresponding to the five relevance types are different in an actual dataset.

Generally speaking, the number of documents labeled as “most relevant” and “relevant” is relatively small, whether it is for a query keyword or from a general perspective, while there are a large number of documents labeled as “irrelevant” and “most irrelevant.” Therefore, relying solely on classification accuracy may lead to inappropriate results.

For example, a sorting algorithm may correctly classify a large number of documents labeled as “irrelevant” and “most irrelevant” through classification means, but may miss all the “most relevant” documents. Even though from the overall classification accuracy point of view, such an algorithm may still be “acceptable,” in reality, it has no value. Therefore, evaluating ranking algorithms from the perspective of multi-class classification is incomplete.

To address this situation, researchers have designed a ranking evaluation method based on the five-level definition: NDCG (Normalized Discounted Cumulative Gain). I won’t go into details about NDCG here, but you only need to know that NDCG is not a measure of classification accuracy but a measure of ranking precision.

The assumption of this NDCG measure is that in a ranking result, relevant information should be ranked higher than irrelevant information, with the most relevant information ranked at the top and the most irrelevant information at the bottom. Any ranking result that deviates from this assumption will be penalized.

It should be noted that the NDCG we are discussing here is only a ranking evaluation measure for the test set. Our ranking algorithm can still train a multi-class classifier based on the five-level relevance on the training set. It’s just that for the test set, a different method is used to evaluate the results of our multi-class classifier, instead of using the traditional classification accuracy. In a sense, this NDCG serves as a “model selection” method.

Summary #

Today, I discussed the topic of pointwise learning to rank with you. It can be seen that the entire problem setting is fundamentally different from traditional text search techniques.

Let’s review the key points together: Firstly, pointwise learning to rank started in the 1990s, and it wasn’t until after 2000 that more effective research emerged. Secondly, I provided a detailed introduction to the problem setting of pointwise learning to rank, including training sets, test sets, and test environments.

Finally, I’d like to leave you with a question to ponder: Is there a way to combine the traditional ranking algorithms we previously discussed, such as TF-IDF, BM25, and language models, with pointwise learning to rank?