032 Classic Search Core Algorithm Bm25 and Its Variants With Full Year Directory

032 Classic Search Core Algorithm BM25 and Its Variants with Full-Year Directory #

On Monday, we talked about the TF-IDF algorithm and its four variants. In the field of information retrieval and text mining, the BM25 algorithm has a stronger theoretical foundation than TF-IDF. It is also an important baseline algorithm in engineering practice. BM25 was proposed in the 1970s and 1980s, and until now, it has been over twenty to thirty years. However, this algorithm still performs excellently in many information retrieval tasks, making it one of the preferred algorithms for many engineers.

Today, I will talk about the history of the BM25 algorithm, the core concepts of the algorithm itself, and some important variants of BM25, helping you quickly grasp this powerful tool for information retrieval and text mining.

The History of BM25 #

BM25, sometimes referred to as Okapi BM25, is a ranking algorithm developed by a group of British computer scientists in the field of information retrieval. “BM” in this context stands for “Best Match”.

There are two famous British computer scientists behind BM25. The first one is Stephen Robertson. Stephen graduated from the mathematics department at the University of Cambridge, and then obtained a master’s degree from the City University. Later, he earned his doctorate from University College London. From 1978 to 1998, Stephen served as a lecturer at the City University. From 1998 to 2013, he worked at Microsoft Research Cambridge. As we mentioned before, the Association for Computing Machinery (ACM) now awards the Gerard Salton Award every three years to recognize researchers who have made outstanding contributions to information retrieval technology. In 2000, this award was presented to Stephen, in recognition of his theoretical contributions to information retrieval. BM25 can be considered the most important achievement in Stephen’s lifetime.

The other important computer scientist is Karen Spärck Jones from the United Kingdom. We mentioned her in the article on TF-IDF. Karen also obtained a doctorate from the University of Cambridge and dedicated her lifetime to researching information retrieval technology. Karen’s major contribution was the discovery of IDF and the overall summary of TF-IDF. In 1988, Karen received the second Gerard Salton Award.

Explanation of the BM25 Algorithm #

The modern BM25 algorithm is used to calculate the “relevance” of a target document to a query keyword. In general, BM25 is a typical representative of “unsupervised learning” ranking algorithms.

As the name suggests, “unsupervised” means that the algorithm does not know whether a document is relevant to a query keyword. In other words, the algorithm cannot simply learn relevance from the data, but rather “guesses” the characteristics of relevant documents based on some empirical rules.

So, how is BM25 defined? Let’s start with the definition of traditional BM25. Generally speaking, the classic BM25 consists of three parts:

Relevance between a word and the target document
Relevance between a word and the query keyword
Weighting of the word

The product of these three parts forms the score of a word. The score of the entire document relative to a query keyword is the sum of the scores of all words in the query keyword.

Let’s start with the first part, the relevance between a word and the target document. The basic idea of relevance here is still “term frequency” (TF), which is part of TF-IDF. Term frequency refers to the number of times a word appears in the target document. If the word appears more frequently, it is generally considered to be more relevant. Unlike TF-IDF, one of the biggest contributions of BM25 is to uncover the non-linear relationship between term frequency and relevance, which is initially counterintuitive but makes sense upon further thought.

Specifically, the score of each word for document relevance does not exceed a certain threshold. This threshold is dynamic and can be adjusted based on the document itself. This feature distinguishes the word frequency calculation in BM25 from general TF. In other words, term frequency itself needs to be “normalized” to ensure that the contribution of a word to the final score does not increase infinitely with the increase of term frequency.

How is the normalization of term frequency in BM25 done? It is achieved by dividing the term frequency of a word by the sum of its term frequency and a weight. This weight includes two hyper-parameters that can be manually adjusted according to the situation. This approach is common in unsupervised ranking algorithms. At the same time, this weight also includes two important pieces of information: the length of the current document and the average length of all documents in the dataset.

By mixing these factors, we obtain a new formula for term frequency that ensures a positive relationship between the word’s relevance to the document and its term frequency, and restricts the term frequency based on the relative length of the document, which is the ratio of the original length to the average length of all documents, plus some hyper-parameters.

With the calculation formula for relevance of a word relative to a document as a basis, the relevance of a word relative to a query keyword can be said to be similar. First, we need to calculate the term frequency of the word in the query keyword. Then, a similar normalization process is applied to this term frequency.

The only difference from the document normalization process is that document length is not utilized here. Of course, if length is needed for query keywords, it should be the length of the query keyword compared to the average length. However, according to the classic BM25 formula, this part does not utilize length information for re-normalization.

Now, let’s talk about the details of the last part, the word weighting. There are usually two choices.

The first choice is to directly use a transformed version of IDF to weight the word. Generally, IDF transforms the “document frequency,” which is the number of documents that contain a certain word, using a logarithmic function. As a recap of what we discussed on Monday, IDF is the reciprocal of the document frequency and is transformed using a logarithmic function. If IDF is used here, the entire BM25 can be regarded as a TF-IDF of some kind, only that the TF part is a complex term frequency function based on documents, query keywords, and has two parts.

The second choice for word weighting is called “Robertson-Spärck-Jones” (RSJ) weighting. It was discovered by computer scientists Stephen Robertson and Karen Spärck Jones. As we just discussed, both are important authorities in the field of information retrieval. This weighting is actually a more sophisticated version of IDF. One key difference is that RSJ weighting requires supervised information, namely, whether a document is relevant to a query keyword, while IDF does not require this.

In comparison between the two approaches, in many cases, it is more common to directly use IDF in word weighting. If there is supervised information available, RSJ weighting is also a good choice.

Through this brief introduction, we can easily see that BM25 is actually an empirical formula. Each component in it has been gradually discovered through iterations by many researchers. Many studies have theoretically modeled BM25, starting from the “probabilistic relevance model,” and derived that BM25 is actually an approximation of a certain class of probabilistic relevance models. I won’t go into detail about this part here. What you need to remember is that although BM25 is an empirical formula, it often exhibits astonishingly good performance in practical use. Therefore, it is necessary to have an understanding of this type of document retrieval algorithm.

Variations of the BM25 Algorithm #

Due to the nature of BM25, which is both an empirical formula and an approximation of a certain theoretical model, various variations of BM25 have emerged. Here, I will only introduce some representative extensions.

One important extension is BM25F, which calculates relevance on multiple “fields” of a document, based on the original BM25 formula. The concept of “fields” here refers to different aspects of a document. For example, for many documents, the document may consist of a title, an abstract, and a body. Each of these components can be considered as a different “field”. The core content of BM25F lies in how to combine different fields to unify the relevance of a document into a single score.

Specifically, BM25F extends BM25 in a straightforward manner. It treats each word’s relevance to a document as the weighted average of each field treated as a separate “sub-document”. In other words, we first treat each field as a separate document, calculate the term frequencies, and normalize them. Then, we combine the values of each field, calculate their weighted average, and multiply it by the word’s weight (as mentioned earlier, using IDF or RSJ values).

Another important extension is combining BM25 with other document information (non-textual). This idea was a common practice before the emergence of the “Learning To Rank” approach, which often directly combines various information in a linear weighted form. For example, a popular approach in the early 21st century was to determine the relevance of web pages by linearly combining BM25 and PageRank. In this case, BM25 represents the information related to a specific query keyword, while PageRank represents the overall weight of a webpage.

Summary #

Today I have explained to you the most fundamental technique in the field of document retrieval or search, which is BM25. We can see that BM25 consists of three core concepts, including the relevance of a word in a document, the relevance of a word in a query keyword, and the weight of a word. BM25 is an empirically accumulated formula with deep theoretical support, and it is a powerful unsupervised learning method for text ranking.

Let’s review the key points together: first, a brief introduction to the history of BM25. Second, a detailed explanation of the three main components of the BM25 algorithm. Third, a brief introduction to some variations of BM25.

Finally, I leave you with a question to ponder: although BM25 is an unsupervised ranking method and we mentioned that it has some hyperparameters, can we learn the optimal values of these hyperparameters through machine learning?