038 Query Keyword Understanding Part Two Analysis

038 Query Keyword Understanding Part Two - Analysis #

The core content I shared this week is Query Understanding. On Monday, I introduced the basic concepts and ideas of Query Classification. Today, I will talk about a more refined module of Query Understanding: Query Parsing.

If Query Classification is an overview of query keywords, then Query Parsing is a detailed analysis. In fact, Query Parsing is a general term for a set of techniques, and today I will discuss a few popular topics.

Query Segmentation #

First, let’s imagine a scenario in an English search engine. If a user enters the query “White House Opening,” what is the user’s intent? To understand the user’s intent, we need to know the meaning of the words the user entered.

So, in the query “White House Opening,” do we understand each individual word “White,” “House,” and “Opening”? Or do we understand “White House” and “Opening” as separate phrases? Or is it possible that “White House Opening” is a whole phrase? This is what we call “query segmentation.”

In the example above, how we segment “White House Opening” directly impacts the quality of search results. In a standard modern search engine, there is usually a module that extracts documents from the “inverted index” based on the query keywords. The number of extracted documents in this stage is usually a few hundred to a few thousand, and this process is often referred to as the “retrieval phase.”

Once these documents are obtained, modern search engines use complex ranking algorithms, usually based on machine learning models, to re-rank the documents.

As you can see, in these two stages of the process, if good documents are not extracted in the first stage, no matter how powerful the second stage’s functionality is, the overall search results cannot be very good. For the “retrieval phase,” the key to querying the “inverted index” is what “words” or “phrases” to use for searching.

Using the previous example, it is about whether the document matches “White House,” “White” or “House,” or “White House Opening.” It is obvious that the resulting document sets differ in these three cases. If the user’s true intent is to search for the opening hours of the White House, segmenting this search keyword as “White” or “House” will obviously affect the extracted document set.

So how should we do query segmentation?

Here, I introduce a paper called “Query Segmentation Revisited”. In this paper, the authors focus on introducing some mainstream “query segmentation” techniques, and the article is very worth reading. Now, let me summarize the main points for you.

The first technique is to generate “N-grams” from the query keywords. N-grams refer to generating continuous sub-words from a group of words. For example, in the example of “White House Opening” mentioned earlier, we can generate two bigrams from this phrase: “White House” and “House Opening.”

The first method based on N-grams is to determine the meaningfulness of the segmentation by the frequency of these N-grams appearing in a large corpus. Of course, directly using frequency may prefer shorter words, so in the paper, the authors introduce two ways to correct the frequency.

One is based on the frequency itself, and the other is based on Wikipedia as an external resource for correction. The aim of both methods is to give longer phrases a chance to have a higher score than shorter words. The required word frequencies in the paper are obtained from Google’s “N-grams” corpus released in 2005, which means that the frequencies of all words are directly obtained from this corpus.

The second technique is based on “Mutual Information” of phrases. “Mutual Information” calculates the correlation between two random events. Here, it calculates the “mutual information” between every two adjacent phrases in the query keywords. When the value of this “mutual information” is greater than a pre-set threshold, it is considered that the two adjacent words form a phrase. The calculation of “mutual information” requires knowing the probability of a word appearing, which is obtained from an “N-grams” corpus released by Microsoft.

The third technique is based on “Conditional Random Fields” (CRF). “Conditional Random Fields” is a sequence model proposed by the famous machine learning scholars John D. Lafferty, Andrew McCallum, and Fernando Pereira in 2001. The basic idea of CRF is to model complex labels of the output and try to establish a correspondence between the feature space and the complex labels.

In the scenario of “query segmentation,” we can actually regard the complex labels as multiple binary decision problems from a query keyword to multiple phrases. Here, binary decision means whether a candidate phrase can be used as a segmentation phrase. Conditional Random Fields can intuitively model these types of problems, while traditional binary classifiers are difficult to model sequence information. I won’t go into the detailed introduction of Conditional Random Fields here. If you are interested, you can refer to relevant papers.

Query Keyword Annotation #

Just now, I talked about the basic “segmentation” problem in understanding query keywords. It can be said that the “segmentation problem” is the first step in understanding query keywords. The next step is to analyze query keywords in more detail.

Returning to the previous example “White House Opening”, we not only want to know that this query keyword can be segmented into “White House” and “Opening”, but also want to know if “White House” is the name of a building or a geographical location, while “Opening” may be a noun suggesting “opening hours”. In other words, we want to “annotate” the phrases in the query keyword to obtain their “attributes”. The component that annotates the phrases segmented from the query keyword is called “Query Keyword Annotation”.

So, how does annotation information help with search results? Let’s consider the query keyword “apple price”. Depending on the user’s search context, if “apple” represents the attribute of “fruit”, then the result of this query would be to find the price of fruits and may also need the search engine to return information about nearby supermarkets. But if “apple” actually represents “mobile phone”, then the optimal result of this query may be to return the official sales website of Apple Inc. You see, with different attributes represented by “apple”, the best results might vary greatly.

There are many methods for annotating query keywords. Here I recommend another classic paper titled “Structural annotation of search queries using pseudo-relevance feedback”. This paper uses a method called PRF (Pseudo-Relevance Feedback) for annotation. One technical challenge here is that there is very little information in query keywords, so a lot of auxiliary information is needed for annotation. Therefore, PRF is applied as a technique in this case.

Another mainstream method for query keyword annotation is still using Conditional Random Fields (CRF). As I mentioned earlier, CRF is a good tool for sequence modeling. So in this case, taking “apple price” as an example, CRF needs to predict whether the label is a combination output of “mobile phone noun” or “fruit noun”. Traditional binary or multi-class classifiers are not good at capturing the sequence information here, while CRF is a powerful tool to solve this problem.

Therefore, what we need to do is to construct features for query keywords, and then directly feed them into CRF. One thing to note is that the success of applying CRF depends a lot on the amount of data. Therefore, constructing a dataset with annotated information becomes a core challenge in query keyword annotation.

Summary #

Today I discussed an important aspect of modern search technology, which is the query keyword understanding in query keyword parsing. You can see that query keyword parsing can be divided into two important modules: query keyword segmentation and query keyword annotation.

Let’s review the key points together: First, I briefly introduced the scenarios and three main techniques of query keyword segmentation, which are “N-gram”, “mutual information”, and “conditional random fields”. Second, I provided a detailed introduction to the scenarios and main techniques of query keyword annotation, including the use of pseudo-relevance feedback (PRF) and conditional random fields (CRF) as two mainstream annotation methods.

Finally, I leave you with a question to ponder: I discussed the parsing problem of English query keywords, so what are the special challenges for Chinese query keywords?

References

  1. Matthias Hagen, Martin Potthast, Benno Stein, and Christof Bräutigam. Query segmentation revisited. Proceedings of the 20th international conference on World wide web (WWW ‘11). ACM, New York, NY, USA, 97-106. 2011.
  2. Michael Bendersky, W. Bruce Croft, and David A. Smith. Structural annotation of search queries using pseudo-relevance feedback. Proceedings of the 19th ACM international conference on Information and knowledge management (CIKM ‘10). ACM, New York, NY, USA, 1537-1540. 2010.