048 Overview of Search Indexing and Related Technologies

This week, our topic is the analysis of modern search architectures from a macro perspective. On Monday, I introduced a major classification of search systems: traditional text matching information retrieval systems that have been developed and used since the 1950s, and machine learning information retrieval systems that have been developed since 2000 and gradually matured. On Wednesday, we analyzed another framework system of search systems, the multi-round ranking system, explaining why multi-round ranking is necessary and the characteristics of each round of ranking.

Today, let’s take a look at a topic that has been repeatedly mentioned this week: inverted index. Let’s discuss its core technology together. It is worth noting that many topics related to indexing actually involve “query keyword processing” in search, and today’s sharing mainly focuses on the application of indexing and related technologies in the context of “query keyword processing”.

Classic Index Structure #

The classic index structure consists of “fields” and their corresponding lists. Generally, a “field” refers to a specific query keyword. In English, this is usually a single word, while in Chinese, it may be a word or a phrase. The list associated with each field contains the documents that contain this query keyword.

There are two points worth noting.

First, the documents in the document list are often arranged in a certain order of importance, making it easier for us to extract the most important documents first. For example, if a document has a high relevance with the query keyword, it will be ranked higher in this list.

Second, not all documents that contain the query keyword will necessarily be included in the list for each field. This list can be an excerpt.

Also, as we mentioned earlier, the reason it is called an “index” is because this list does not actually store the entire document, often only the document identifier.

If the query keyword entered by the user contains multiple phrases, based on this basic structure, we can easily obtain a collection of documents that contain all the keywords. This operation is equivalent to doing a “merge sort” on multiple lists.

In addition to storing only basic document label information in the index, some other basic information about the document can also be stored in the index. For example, the frequently stored information includes the number of occurrences of a query keyword in a document. Saving this frequency information essentially preserves the document characteristic of “term frequency”.

When we shared the classic information retrieval models earlier, we introduced many models such as TF-IDF, BM25, or language models, all of which heavily rely on calculating term frequency. Storing term frequency information in the index helps approximate the calculation of these basic retrieval models.

Another frequently stored information is the position where the query keyword appears in the document. Position information is especially important when there are multiple query keywords. For example, if the phrase we want to search for is “Wudaokou Cinema”, in this case, we really hope that the position where “Wudaokou” appears in a document is adjacent to the position where “Cinema” appears. This way, we can confirm that the document is indeed about “Wudaokou Cinema” and not just happens to contain the words “Wudaokou” and “Cinema”.

At the same time, position information can also help the search engine generate “result summaries” on the search result page. We often see a few sentences of summary information on the search results page, which requires the position of the query keyword to generate.

Indexing Techniques #

In addition to the most basic indexing techniques, developers have developed various technologies to make indexing more efficient.

The first technique, of course, is to compress the index. The index information quickly expands as the number of possible keywords increases. The document list corresponding to each keyword in the index will also become increasingly large. Therefore, the ability to process the index information quickly and save time for subsequent calculations becomes crucial. This week, we shared a multi-step scoring system. The key idea of the multi-step scoring system is that the entire process must be completed within a few hundred milliseconds of response time. Therefore, every step, including the process of extracting the “top K documents” from the index, needs to be fast.

Compression technology is extensive and profound, and we will not go into detail in today’s sharing. Here, we only need to grasp the basic idea of this problem from a high-dimensional perspective. One basic information of the index is the document list relative to a query keyword. And what is stored in the document list is not the data of the document itself, but some information of the document, such as the document ID. And the ID is a number, and the document list is ultimately a sequence of numbers. Many algorithms in compression technology compress a sequence of numbers.

So, how can we achieve compression? Here’s an example. For example, there is a compression algorithm based on a technique called “delta encoding”. Simply put, instead of directly recording the document ID itself, it records the difference between document IDs in the order of the document IDs.

For some very frequent query keywords, these terms may appear in a large number or even the majority of documents. By using this “delta encoding” to rearrange the document list, we can use a set of small numbers (these numbers represent the difference between adjacent document IDs) to represent the document list. Of course, this method is definitely not effective for query keywords with very few documents. At the same time, this technique requires the document list to be sorted by document ID, not by relevance.

In the development process of indexing, some small techniques have also been developed, such as “skipping”. Simply put, this technique is that when we have multiple query keywords, and the frequency of these keywords varies greatly, we can skip some documents.

For example, in the combination of “Beijing, subway travel”, the frequency of “Beijing” may be several times, even ten or even hundreds of times higher than that of “subway travel” in the entire data set. Therefore, we do not actually need to search for all documents that contain “Beijing”, because what we finally need is just an intersection that contains both keywords. Therefore, when processing the document sequence of “Beijing”, we can “skip” K documents and then see if we have reached the next document that contains “subway travel”. Here, K is certainly a parameter that needs to be tried. With this idea, the effect of processing multiple query keywords can be significantly improved.

Query Keyword Processing #

Finally, let’s talk about query keyword processing. To put it simply, it is about how to extract relevant documents from an index and calculate their scores. There are two basic approaches.

The first approach is called “Document-at-a-Time” calculation strategy. Essentially, we first find all the documents in the index that correspond to the query keywords. For example, if we are processing the query keyword combination “Beijing, subway travel,” we first retrieve all the documents that contain these keywords. Then, we maintain a “priority queue” to store the top K documents with the highest scores. Next, we calculate the scores for each retrieved document, which could be a simplified retrieval model based on term frequency. After calculating the scores, we push them into the priority queue.

The second approach, which is complementary to the “Document-at-a-Time” approach, is called “Term-at-a-Time” calculation strategy. In this approach, we process each word of the query keyword phrase one by one. Please note that the first step is the same as before, where we still retrieve all the document collections. However, after this step, we first process the documents that contain “Beijing” and obtain a partial score for all documents. Then, we process “subway travel” and update the partial score with the calculations from the previous step, resulting in the final score.

In practical applications, these two strategies form the basis for more complex optimization of query keyword processing. On top of these strategies, many advanced algorithms have been developed. These algorithms not only enable fast processing of textual features but also include techniques such as the WAND operator, which can simulate linear models.

Summary #

Today I have talked about a core component of modern search technology, which is the inverted index system. Let’s review the key points together: Firstly, we discussed the basic components and principles of the index system. Secondly, we gave an overview of related index technologies, focusing on compression and the meaning of “skipping”. Thirdly, we briefly explained two fundamental strategies for processing query keywords.

Finally, I will leave you with a question to ponder: If we have both image information and text information, how should we construct our index?