046 Macro Perspective on the Development of Large Scale Search Frameworks Features and Trends

046 Macro Perspective on the Development of Large-Scale Search Frameworks - Features and Trends #

In the past few weeks, we have discussed a series of classic information retrieval techniques and machine learning-based ranking learning algorithms. We have also spent some time discussing the core technology points of two key search components, including query understanding and document understanding. In addition, we have delved into how to evaluate a search system from both online and offline perspectives. I believe you now have a basic understanding of the various fundamental components of a search system.

Today, for the first time, we will take a holistic look at the evolution and historical development of large-scale search system frameworks, giving you a macro-level understanding. With the foundation of previous knowledge, I believe today’s sharing session will make you feel that everything falls into place.

Text Matching based Information Retrieval System #

When we introduced the classic information retrieval systems such as TF-IDF and BM25, we actually introduced the core concepts of text matching based information retrieval systems.

In fact, from the 1950s when information retrieval systems began to emerge, until around 2000, these pure text matching search systems have always been the foundation of mainstream search systems. Even many current open-source search frameworks are based on these fundamental information retrieval systems.

In summary, these text matching based information retrieval systems have several characteristics.

Firstly, the foundation of a text matching system is an inverted index. The “field” in the index is a query keyword, and each field corresponds to a list of documents that contain this query keyword. This list of documents is often arranged in some order of importance.

For example, if the relevance between a document and the query keyword is high, it will be ranked higher in this list. Of course, not all documents containing the query keyword will be included in this list. Additionally, the reason it is called an “index” is that this list does not actually store the entire document, often only storing the document ID.

From this basic index structure, many research and practical issues have emerged.

For example, how to further optimize the construction of this index. Especially when the number of documents in the list is large, or when there are many query keywords, it becomes crucial to use some encoding pattern to compress the index.

At the same time, a large index also brings many performance issues. For example, when the index is too large, a certain part or a large part of the index cannot be stored in memory. At this time, the performance of the entire search system is greatly threatened because switching content between memory and disk is required when processing query keywords. Therefore, innovating the index to enable it to be used in memory and perform fast queries is a very important topic.

Another characteristic of text matching systems is their reliance on traditional retrieval methods, such as TF-IDF or BM25 and their variants. These methods bridge the gap between query keywords and the index, assigning a numerical value to each document-query keyword pair, which can be used for sorting.

However, the biggest issue with these methods is that they are not based on machine learning. In other words, these methods are based on assumptions and experiences of researchers, and often cannot adapt to existing data. It is precisely because of this that the development of this kind of method often feels lacking in theoretical foundations.

Finally, traditional text matching systems also face difficulty in naturally handling multimodal data. as we mentioned before, if the data contains a mixture of text, images, graphs, and other comprehensive information, text matching methods do not provide much theoretical guidance in this aspect.

So, what are the advantages of text matching systems? In fact, even today, the biggest disadvantage of text matching systems is also their biggest advantage: not relying on machine learning. In other words, if you want to build a new search system or add search functionality to some app, it is easiest to start with a text matching system, as it does not require any data dependency and minimal tuning to go live. However, this advantage of text matching systems is often overlooked by many people today.

Machine Learning-Based Information Retrieval System #

Since 2000, the trend of machine learning-based information retrieval systems has gradually become the mainstream in building search systems. Information retrieval systems within this framework have the following characteristics.

First, machine learning-based systems now have a complete set of theoretical support. For example, we have previously discussed methods such as pointwise ranking, pairwise ranking, and listwise ranking, which use general machine learning language to describe search problems.

What is this general machine learning language? It means having a clear objective function, explicit features, and specific algorithms to solve machine learning problems within these frameworks. At the same time, a series of basic principles in machine learning, such as training data, test data, and evaluation methods, can also be applied to the scenarios of information retrieval. This provides important guidance for the performance of search systems and the overall development of search systems.

At the same time, this has also opened a convenient door to improve the effectiveness of search systems. Any advancements within the field of machine learning can be easily attempted within the existing machine learning-based search system framework. For example, the recent rapid development of deep learning can be easily applied to the established machine learning-based search system framework.

Second, machine learning-based search systems can easily utilize multimodal data. For machine learning, the fusion of multimodal data, or multiple types of data, can be naturally expressed through different types of features. Therefore, machine learning has a natural advantage when it comes to multimodal data. Learning the connections between these features and predicting relevance is a strength of machine learning.

As a result, the understanding of data from different parts of the search system and using that information in ranking algorithms has become abundant. For example, as mentioned before, the features generated from query understanding, such as query keyword classification and query keyword parsing, as well as document understanding, such as document classification, are difficult to imagine being applied in traditional text matching systems. But in machine learning-based search systems, this information often becomes an important tool for improving relevance modeling.

At the same time, as we mentioned in previous discussions, there is specialized research in machine learning specifically targeting multimodal data, considering how to better integrate different types of data in modeling. Such research simply does not exist in traditional text matching search systems.

Machine learning-based search systems are not without flaws. In fact, without various guarantees, machine learning may not necessarily achieve satisfactory results in practice, as it imposes higher demands on the entire system.

Machine learning often requires large amounts of data, and in a real software product, building reliable and clean data is not a simple task. Without reliable data, for most machine learning algorithms, the saying “garbage in, garbage out” holds true, and the actual results are often worse than not using machine learning at all.

At the same time, machine learning systems may have various issues such as feature anomalies, model anomalies, and data anomalies that are not present in other software systems. If there is no estimation and handling of these situations in a production system, machine learning search systems often fall short of expectations.

More intelligent search system #

It is obvious that search systems will not just rely on ordinary machine learning algorithms. In recent years, the development of search systems has two aspects.

On the one hand, it relies on the development of deep learning, and many scholars and researchers are thinking about how to use deep learning technology to take the search system to a higher level. In this aspect of research and development, it is not only focused on ordinary deep learning algorithms, but also on how to rethink search problems by applying certain patterns unique to deep learning, such as deep reinforcement learning.

On the other hand, it is from the user’s perspective to research more meaningful evaluation methods. That is to say, how to truly capture the user’s preferences for this system and further optimize the performance of this system.

Summary #

Today I talked to you about the development of modern search technology frameworks and briefly mentioned the current trends in search system development. Let’s review the key points together: First, we discussed the characteristics of classical search systems based on text matching; Second, we discussed the characteristics of search systems based on machine learning.

Finally, I’ll leave you with a question to ponder: In the era of machine learning and deep learning, can the core of traditional search systems, which is the index we mentioned before, be generated through machine learning?