031 Classic Search Core Algorithm Tf ID F and Its Variants

This is a YAML front matter block in Markdown format. It specifies the weight value as 33.

031 Classic Search Core Algorithm TF-IDF and Its Variants #

Starting this week, we will enter the core technology module of artificial intelligence. This week, I will focus on explaining the classic search core algorithm. Today, let’s start with the introduction of the TF-IDF algorithm.

In the field of information retrieval, text mining, and natural language processing, the TF-IDF algorithm is well-known. Although in these fields, there have been many new text representation and scoring methods based on deep learning, TF-IDF still plays an irreplaceable role in many applications as a fundamental method.

Understanding and mastering the TF-IDF algorithm is greatly beneficial for beginners, as it helps them to better understand other more in-depth and complex text mining algorithms and models. Today, I will discuss the history of TF-IDF, the details of the algorithm itself, and several variants based on TF-IDF.

The History of TF-IDF #

The effort to convert both query keywords and documents into “vectors” and attempt to solve information retrieval problems using tools such as linear algebra can be traced back to at least the 1970s.

In 1971, Gerard Salton, a professor at Cornell University in the United States, published the paper “The SMART Retrieval System—Experiments in Automatic Document Processing,” in which he first mentioned the conversion of query keywords and documents into “vectors” and assigned different values to the elements of these vectors. The description of the SMART retrieval system in this paper, particularly the description of TF-IDF and its variations, became an important reference for many subsequent industrial systems.

In 1972, Karen Spärck Jones, a British computer scientist, detailed the application of IDF for the first time in the paper “A Statistical Interpretation of Term Specificity and Its Application in Retrieval.” Afterwards, Karen further discussed the combination of TF and IDF in her paper “Index Term Weighting.” It can be said that Karen was the first computer scientist to provide a complete theoretical proof of TF-IDF, so many people attribute the invention of TF-IDF to her.

Gerard himself is considered the “father of information retrieval.” He was born in Nuremberg, Germany in 1927 and obtained a Bachelor’s and Master’s degree in Mathematics from Brooklyn College in New York in 1950 and 1952 respectively. In 1958, he received a Ph.D. in Applied Mathematics from Harvard University and then joined Cornell University to help establish the Computer Science Department. To honor Gerard’s outstanding contributions to modern information retrieval technology, the Association for Computing Machinery (ACM) in the United States presents the “Gerard Salton Award” every three years to recognize researchers who have made outstanding contributions to information retrieval technology. Karen Spärck Jones received the second Gerard Salton Award in 1988.

Explanation of the TF-IDF Algorithm #

To understand the TF-IDF algorithm, the first step is to understand the application background of TF-IDF. TF-IDF is derived from the most classic and oldest information retrieval model, the “Vector Space Model”.

Simply put, the Vector Space Model hopes to express both query keywords and documents as vectors, and then use vector operations to further express the relationship between vectors. For example, a commonly used operation is to calculate the “relevance” between the vector corresponding to the query keyword and the vector corresponding to the document.

With the representation of vectors, relevance can often be approximated by the “similarity” of vectors in a certain sense, such as cosine similarity or dot product. In this way, relevance can be expressed using a value. Both cosine similarity and dot product can be explained from the perspective of linear algebra or geometry.

In the most basic representation of the vector space model, the vectors of query keywords or documents have a dimension of V. Here, V is the total length of the vocabulary. For example, if we have 10,000 commonly used English words, then the value of V is 10,000, and the vectors of the query keywords and each document are 10,000-dimensional vectors. Each dimension in this vector represents a word in English, without repetition.

As you can see, in this case, if the current word appears in the document or keyword corresponding to the vector, it is represented by 1; if the word does not appear, it is represented by 0. This is the simplest method of assigning values (weighting) to each dimension.

TF-IDF is a more complex way of assigning values under the assumption of the vector space model. The most basic mode of TF-IDF, as the name implies, is the product of TF and IDF.

TF is actually the abbreviation of “Term Frequency”. It means that we calculate the number of times a word appears in the target document for a specific word in the query keyword. For example, if we want to query “Car Insurance,” for each document, we calculate how many times the word “Car” appears in it, and how many times the word “Insurance” appears in it. This is the calculation method of TF.

The hidden assumption behind TF is that the words in the query keyword are relatively more important than other words, and the importance of the document, that is, relevance, is proportional to the number of times the word appears in the document. For example, if the word “Car” appears 5 times in document A, and 20 times in document B, the TF calculation assumes that document B may be more relevant.

However, information retrieval workers soon realized that TF alone cannot fully describe the relevance of documents. Due to language factors, some words may naturally appear repeatedly in many documents, such as “The,” “An,” “But,” and so on in English. These words mostly serve as connective parts to maintain language coherence and are indispensable. However, if we want to search for the keyword “How to Build A Car,” the words “How,” “To,” and “A” are likely to appear in the majority of documents, and in this case, TF cannot help us distinguish the relevance of documents.

IDF, which is “Inverse Document Frequency,” came into being in such a situation. The idea behind it is actually very simple, that is, we need to “penalize” words that appear in too many documents.

In other words, the words that really carry “relevant” information appear in relatively few, sometimes very few documents. This information can be easily calculated using “document frequency,” which is how many documents cover this word. Obviously, if too many documents cover a certain word, the word is less important, or in other words, the word has less information. Therefore, we need to adjust the value of TF, and the idea of IDF is to use the reciprocal of DF to make the adjustment. The application of the reciprocal precisely expresses this idea: the larger the DF value, the less important the word.

After understanding the basic calculation methods of TF and IDF, we can use the product of these two concepts to express the importance of a query word in a target document. It is worth mentioning that although we did not mention how to express the query keyword and the document as vectors when introducing the concept of TF-IDF, the TF-IDF algorithm implies this step.

Specifically, for query keywords, the length of the vector is V, which is the size of the vocabulary mentioned earlier. Then, the dimensions of the keyword that have appeared are 1, and the other dimensions are 0. For the target document, the dimensions in which the keyword has appeared are the values of TF-IDF, and the other dimensions are 0. In this representation, if we perform a “dot product” operation on two documents, the scoring of the relevance obtained is the result of TF-IDF as the scoring of relevance.

Variations of TF-IDF Algorithm #

It is obvious that the classic TF-IDF algorithm does not consider many factors. Over the years, researchers and engineers have developed many variations of TF-IDF. Here I will introduce several classic variations.

First, many people have noticed that the TF value has no upper limit in the original definition. Although we generally believe that a document expressing a query keyword multiple times indicates a certain degree of relevance, it is difficult to say that this relationship is linear. For example, let’s take the example we just mentioned about “Car Insurance”. Document A may contain the word “Car” 100 times, while document B may contain it 200 times. Does this mean that the relevance of document B is twice that of document A? In fact, many people have realized that beyond a certain threshold, TF does not have much discriminative power.

Using logarithmic function to transform TF, namely Log(TF), is a technique to prevent linear growth of TF. Specifically, people often use the value 1 + Log(TF) to replace the original TF value. Under this new calculation, if “Car” appears once, the new value is 1; if it appears 100 times, the new value is 5.6; and if it appears 200 times, the new value is 6.3. It is clear that this calculation maintains a balance, providing discrimination while avoiding completely linear growth.

Another observation related to TF is that the classic calculation does not consider the difference between “long documents” and “short documents”. If document A has 3,000 words and document B has 250 words, even if “Car” appears 20 times in both documents, it cannot be said that these two documents are equally relevant. Normalizing TF, especially based on the maximum TF value of a document, has become another common technique.

The third common technique, also using logarithmic function for transformation, is applied to IDF. Instead of using IDF directly as a “penalty factor”, we can use (N + 1) divided by DF as the reciprocal of a new DF, and then apply a logarithmic transformation on top of it. Here, N is the total number of documents. The benefits of this approach are, first, using the total number of documents for normalization, similar to the normalization mentioned above; second, achieving non-linear growth by utilizing logarithms.

Another important variation of TF-IDF is to standardize the query keyword vector and the document vector, so that these vectors are not affected by the number of valid elements in the vector, i.e., different documents may have different lengths. In linear algebra, vectors can be standardized to a unit vector length. Performing dot product operations on these standardized vectors is equivalent to performing cosine similarity operations on the original vectors. Hence, another way to apply this rule is to directly use cosine similarity in most cases instead of dot product operations.

Summary #

Today I have discussed the most basic technique in the field of document retrieval or search - TF-IDF. We can see that TF-IDF consists of two core concepts: term frequency in the document and document frequency. TF-IDF is based on the assumption of the vector space model.

Let’s recap the key points: firstly, a brief introduction to the history of TF-IDF. Secondly, a detailed explanation of the main components of the TF-IDF algorithm. Thirdly, a brief introduction to some variations of TF-IDF.

Finally, I leave you with a question to ponder: if we want to apply TF-IDF to the Chinese language environment, do we need any preprocessing steps?