033 Classic Search Core Algorithms Language Models and Their Variants

033 Classic Search Core Algorithms Language Models and Their Variants #

In the field of information retrieval and text mining, we have previously discussed the TF-IDF algorithm and the BM25 algorithm. TF-IDF is often the first choice for many information retrieval tasks due to its simplicity and practicality, while BM25 has become an important cornerstone of many industrial systems due to its solid empirical formula.

However, in the minds of information retrieval researchers, they have always been searching for a retrieval model that is both easy to explain, freely expandable, and significantly effective in practical use. This situation was not breakthrough until the late 1990s and early 21st century, when a new model called “language model” emerged and developed. In the following more than ten years, various variants based on the language model have emerged one after another, becoming important research directions in the field of information retrieval and search.

Today, I will talk about the history of the language model, algorithm details, and important variants of the language model, to help beginners grasp this model quickly.

History of Language Models #

The application of language models in information retrieval began at the 1998 International ACM SIGIR Conference on Research and Development in Information Retrieval. Jay M. Ponte and W. Bruce Croft, information retrieval scholars from UMass Amherst, published the first paper applying language models, marking the beginning of a new era.

Bruce Croft is a renowned academic authority in information retrieval. He obtained his PhD from the University of Cambridge in the United Kingdom and has been teaching at UMass Amherst since then. In 2003, he was awarded the Gerard Salton Award by the Association for Computing Machinery (ACM) in recognition of his outstanding contributions to the field of information retrieval. Additionally, Bruce is an ACM Fellow.

Since the publication of that paper, the contributions of Chinese scholar Chengxiang Zhai to language models have been unquestionable. His doctoral dissertation systematically discusses the smoothing techniques of language models and the profound theoretical implications of various language models.

Chengxiang Zhai comes from the Computer Science Department at Nanjing University in China. He obtained his bachelor’s, master’s, and doctoral degrees from Nanjing University in 1984, 1987, and 1990 respectively. In 2002, he obtained another doctoral degree from the Language Technologies Institute of the School of Computer Science at Carnegie Mellon University in the United States.

Chengxiang Zhai has received the 2004 NSF CAREER Award (National Science Foundation’s Faculty Early Career Development Program) and the 2004 ACM SIGIR Best Paper Award. In addition, he was also awarded the prestigious Presidential Early Career Award for Scientists and Engineers (PECASE) in 2004.

Explanation of Language Models #

The core idea of a language model is to use a probabilistic model to describe the relationship between query keywords and target documents. There are many types of language models, and the simplest and most basic one is called the Query Likelihood Retrieval Model. Now let’s talk about some details of this model.

Firstly, let me describe what a language model is. In simple terms, a language model is a probability distribution over a vocabulary. For example, if the vocabulary consists of ten thousand English words, a language model is a discrete probability distribution defined over these ten thousand words. Let’s use the analogy of a dice, which has ten thousand possible outcomes.

Once the concept of a language model is established, the next step for the “Query Likelihood Retrieval Model” is to assume that query keywords are “sampled” from a language model. What does this mean? It means that, similar to sampling from a probability distribution in usual cases, the “Query Likelihood Retrieval Model” assumes that query keywords are sampled from the probability distribution of this language model, resulting in a random process. This view is not only the assumption of this simple language model but also the core assumption of many language models.

Let’s assume that the parameters of this language model, i.e., the parameters of this probability distribution, are known. Then, the task of scoring a query keyword becomes calculating the joint probability, i.e., the joint probability of a set of events, which is the appearance of this set of words, given this probability distribution. In practice, because the joint probability can be very small, a logarithmic transformation is often used to convert the product of probabilities into the sum of logarithms of probabilities.

However, in reality, we do not know the parameters of this language model in advance. This information is generally unknown.

To determine the parameters of this language model, we first need to determine the form of the language model. As I mentioned earlier, a language model is essentially a discrete probability distribution defined over a vocabulary. There are several classic options. First, we can choose a “categorical distribution” function, which is the multinomial distribution that eliminates permutation and combination information. This is also the most common implementation of language models.

Under the assumption of categorical distribution, we consider each word as the result of being sampled from the categorical distribution, and the words are independent of each other. Then, the categorical distribution defined over ten thousand words has ten thousand parameters. Each parameter represents the probability or likelihood of the corresponding word appearing. Of course, these parameters are unknown.

In addition to using categorical distribution or multinomial distribution to model a language model, other discrete probability distributions have also been proposed for language models. For example, the Bernoulli distribution or the Poisson distribution. I won’t go into detail about these different assumptions today, but in practical applications, language models based on other probability distribution assumptions are still mainly in the pure research stage.

Let’s return to the language model based on categorical distribution we mentioned earlier. Since the parameters are unknown, the core of the problem becomes how to estimate these parameters, which falls within the scope of basic statistical parameter estimation.

Because the categorical distribution is a probability distribution, the most direct parameter estimation algorithm in the presence of observed data (the observed data here are the documents and query keywords in reality) is called maximum likelihood estimation. I won’t go into the details of this algorithm here. The core idea of maximum likelihood estimation is to transform the parameter estimation problem into an optimization problem and solve it by maximizing the objective function. Under the assumption of category distribution, the optimum parameter solution for maximum likelihood estimation has an analytical form.

In other words, in the case of having data, we can obtain a unique optimal parameter estimation. Moreover, this solution is very intuitive, which means that the probability of each word appearing is equal to the frequency of the word in the target document divided by the frequency of all words in the target document. In other words, the parameter for each word is exactly equal to the word’s frequency.

In this way, each document corresponds to a category distribution. The number of documents determines the number of category distributions, and each category distribution can be obtained from its own document by calculating all the parameters.

Maximum likelihood estimation has a significant problem: if a word does not appear in the training data, its parameter according to the above optimal solution will be zero.

What does this mean? It means that, under maximum likelihood estimation, the parameter for a word that has not appeared is zero, and the model assumes that the possibility or probability of this word appearing is also zero. This is clearly a very pessimistic estimation because, in any scenario, even if a word has not appeared, its probability should not be absolute zero.

Therefore, how to estimate these “zero” probabilities becomes an important problem in language modeling research and practice. One common technique is called smoothing. The basic idea behind this technique is to assign non-zero estimates to these zero values caused by maximum likelihood estimation. The simplest and most effective approach is to do smoothing based on the frequency of the entire dataset.

Specifically, for each word, we calculate the frequency in the target document as well as the frequency in the entire dataset. The final estimate for this word is a weighted average of these two frequencies. The weights can be adjusted dynamically and serve as another set of hyperparameters.

Another common smoothing strategy involves Bayesian inference. This means adding a prior distribution to the category probability distribution, usually a Dirichlet distribution, and calculating the posterior probability of a word given both the prior distribution and the data. I won’t go into more detail on this approach here.

It is important to note that researchers have found that smoothing in language modeling is essential. On the one hand, it helps address the zero probability estimation problem mentioned earlier. On the other hand, by an algebraic transformation, smoothing in language modeling can be expressed in a similar form to TF-IDF.

As a result, researchers have pointed out that this smoothing actually reduces the probability of excessively popular terms, making the final estimates not solely dependent on the frequency of words. I mentioned this relationship between word frequency and relevance when introducing the TF-IDF algorithm and the BM25 algorithm. Based on experience, there must be a threshold for this relationship.

Variants of Language Models #

There are many types of variants of language models, and here I will briefly discuss two representative directions.

One direction is different types of smoothing strategies. For example, combining full dataset smoothing with Dirichlet smoothing. Another approach is to divide the documents into clusters or different categories (such as different topics) and perform smoothing based on these categories or topics.

Another direction involves modifications to the definition of the language model itself. For example, in the query likelihood retrieval model, we assume there is a language model and the query keywords are samples from this model. At first glance, this seems reasonable, but upon closer inspection, this model does not explicitly define the relationship between the target document and the query keywords. The target document only comes into play to estimate the parameters of this language model, and the concept of “relevance” is not clearly defined.

Therefore, another mainstream language model assumes the existence of two models (distributions). The query keywords are generated from one distribution, and the target document is generated from another distribution. The distance between these two distributions becomes the definition of relevance. In this structure, documents and query keywords form a symmetrical situation, and relevance is directly defined based on the distance.

Summary #

Today I explained to you a theoretical technique in the field of document retrieval or search technology: language modeling. We can see that language modeling is actually more intuitive and easier to understand compared to TF-IDF and BM25. Language modeling is also a powerful unsupervised learning method for text ranking.

Let’s review the key points together: First, a brief introduction to the history of language modeling. Second, a detailed explanation of the simple language model, which is the main component of the “query keyword likelihood retrieval model”. Third, a brief introduction to two variants of language modeling.

Finally, I leave you with a question to ponder: If we cannot obtain the optimal analytical solution mentioned earlier based on the language model, which is the estimation of the probability distribution function, how should we solve the parameters of the language model?