039 Query Keyword Understanding Part Three Expansion

039 Query Keyword Understanding Part Three - Expansion #

In the previous two articles this week, we introduced the basic concepts and ideas of query keyword classification and query keyword parsing respectively. Today, I will talk about a slightly different module for understanding query keywords: query keyword expansion.

The problem that query keyword expansion aims to solve is slightly different from classification and parsing. Its main purpose is not only to understand the keywords entered by users, but also to supplement the information provided by users, in order to achieve the effect of enriching query results.

Concept of query keyword expansion #

Why do we provide query keyword expansion? The main reason is that the information provided by users in their search keywords is insufficient. Do you remember the example we mentioned last time, “apple price”? In this example, it is impossible to determine whether the user wants to know the price of “apple” as a fruit or as a phone from this query keyword. Therefore, as a search engine, if we provide users with some “expansion options”, which is a reformulated query keyword, it will provide a better user experience and more accurate search results.

In addition to improving the user experience when displayed, query keyword expansion also serves to increase document “recall”, thus laying the foundation for improving search results. Consider the following example: a user searches for “iphone 6 backup” in order to find information on how to back up their iPhone 6. Since the backup process for most Apple phone models is similar, if we expand “iphone 6” to other iPhone models and then check if there are any good web pages introducing backup methods, we can provide the user with relevant results.

It is worth noting that expanding keywords may also result in loss of “precision”. For example, let’s assume that Apple has made significant improvements to the backup process for the iPhone 7, which may not apply to other models. Therefore, when a user searches for “iphone 7 backup”, if we expand it to other models, the information they see is likely to be less relevant. Thus, striking a balance between “precision” and “recall” becomes an important consideration in query keyword expansion.

Another important application of query keyword expansion is handling synonyms and abbreviations. For example, Donald Trump is the current President of the United States. When a user searches for “Donald Trump”, “Trump”, “US President”, “POTUS” (an acronym for “President of the United States”), or similar terms, the search engine should provide similar results. However, these terms may have significant differences in surface form (such as “Trump” and “POTUS”), so other means are needed to learn the synonymous meanings of these words.

Techniques for Query Keyword Expansion #

Now that we understand the meaning of query keyword expansion, let’s take a look at the techniques that can support this expansion.

Based on the examples provided earlier, you can see that the core is to find the “synonyms” in terms of search results. So, how do we discover “synonyms” in search?

Today, I will share two approaches here.

The first approach is based on the natural combination between query keywords and search results to achieve synonym effect. This requires mining user search behavior data on a large scale. The basic assumption here is that if we have two search keywords, A and B, and from the search results of A, users may click on some webpages, while from the results of B, users may click on other webpages. If these clicked webpages are very similar, then we can consider A and B as synonymous query keywords.

A more complete approach is to represent query keywords and webpages as different types of nodes in a “graph”. Each keyword node is connected to multiple webpage nodes, symbolizing the relevance of these webpages to the keyword. From the perspective of each webpage, multiple keyword nodes are connected to the same webpage node, indicating that these keywords might be relevant to a particular webpage.

Taking the example of Trump mentioned earlier, if the homepage of the White House is represented as a node, it may be related to query keywords such as “Trump,” “US President,” and “POTUS.” Therefore, you can see that the task of finding synonyms becomes the work of mining similar nodes, especially similar keyword nodes, on this graph.

If we separate the keyword nodes from the webpage nodes, we have a typical “bipartite graph.” The characteristic of a bipartite graph is that nodes within the same edge are not connected (e.g., there is no connection between keywords), and all connections occur between nodes in different edges (keywords and webpages).

The clustering problem of bipartite graphs is a classic problem in machine learning. Many researchers have attempted to use clustering techniques on bipartite graphs to mine synonyms for query keywords. I have listed a few references at the end, such as Reference [2], which uses “random walk” on a bipartite graph and the “hitting time” generated by random walk to mine similar keywords. If you’re interested, you can check out this classic paper.

After discussing the approach based on user behavior information and keyword mining, let’s take a look at the second approach.

The core of the second approach is to analyze the relevance between words from a massive amount of text information. It’s important to note that the relevance between these words may arise from language itself. For example, the word “Male” and “Man.” It could also be contextual, such as discussing “iPhone 6” and “iPhone 7” in webpages about smartphones.

The idea behind this approach is to create a “representation” for each word group in order to find synonyms through this representation. A popular approach in recent years is to find numerical representations for words, commonly referred to as “embeddings.” If two words are close in an “embedding space,” typically a “Euclidean space,” we can consider them as synonymous.

How can we generate “embedding” vectors for word groups? There are various approaches. One commonly used algorithm is Word2Vec (see Reference [3]), which aims to predict the probability of a word appearing based on the surrounding words in a sentence of a document. Imagine in numerous help documents or websites discussing Apple iPhones, the phrases describing how to back up data for iPhone 6 or iPhone 7 are similar, and the only difference may be the model names.

Therefore, in such cases, learning the “embedding vectors” of words through the “contextual information” around the text can effectively capture the semantics of words. And through semantics, we can identify other synonymous words. Of course, to apply this to query keyword expansion, additional adjustments may be required, as mentioned in one of the references listed at the end [4]. If you’re interested, I recommend reading that paper in detail.

Finally, I need to point out that the first approach requires a significant amount of user interaction data, while the second approach can be learned from other corpora (e.g., Wikipedia) without the need for user data. This is another valuable point to consider.

Summary #

Today I talked to you about the problem of query keyword expansion in understanding query keywords. As you can see, there are two technical approaches to query keyword expansion. One is to generate a graph based on user interaction data and use graph mining techniques to obtain relationships between query keywords. The other is to generate word embedding vectors to obtain synonyms.

Let’s review the key points together: First, a brief introduction to the connotation of query keyword expansion. Due to the insufficient information of user input query keywords, query keyword expansion can provide a better user experience and more accurate search results. Second, a detailed introduction to the two main technologies of query keyword expansion.

Finally, I leave you with a question to ponder: How can we test the pros and cons of query keyword expansion?

References

  1. Claudio Carpineto and Giovanni Romano. A Survey of Automatic Query Expansion in Information Retrieval. ACM Computing Surveys. 44, 1, Article 1 (January 2012), 50 pages. 2012.
  2. Qiaozhu Mei, Dengyong Zhou, and Kenneth Church. Query suggestion using hitting time. Proceedings of the 17th ACM conference on Information and knowledge management (CIKM ‘08). ACM, New York, NY, USA, 469-478. 2008.
  3. Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).
  4. Diaz, Fernando, Bhaskar Mitra, and Nick Craswell. Query expansion with locally-trained word embeddings. arXiv preprint arXiv:1605.07891 (2016).