037 Query Keyword Understanding Part One Classification

037 Query Keyword Understanding Part One - Classification #

In the previous two columns, we mainly discussed the most classic Information Retrieval (IR) techniques and machine learning-based ranking algorithms.

Classic IR techniques provided the basic algorithmic support for search engines before the year 2000. The derived methods such as TF-IDF, BM25, and Language Models, as well as their variations, continue to play a role in many fields (not limited to text).

On the other hand, ranking learning algorithms led to the generation and development of various machine learning-based search algorithms from 2000 to 2010, further maturing search engine technology.

This week, we transition from ranking algorithms to a very important part of ranking problems: query keyword understanding. In other words, we want to understand the purpose behind various user behaviors through query keywords. The features generated by query keywords are often strong guiding factors and a vital source for personalized search results. Therefore, it becomes necessary to gain a deep understanding and mastery of the techniques in query keyword understanding.

The most basic step in query keyword understanding is categorizing query keywords to understand the user’s intent. Today, I will discuss some basic concepts and techniques of query keyword classification, allowing you to have a basic understanding of development and research in this area.

History of Query Keyword Classification #

From the first day that commercial search engines came into existence, people realized that a lot of user information, especially user intent, could be derived from query keywords. As early as 1997, the commercial search engine Excite began researching on a million-level query keyword. However, the systematic exposition of query keyword classification came from Andrei Broder’s paper “A Taxonomy of Web Search”.

Andrei is well-known in the field. During his PhD at Stanford University, he was mentored by Turing Award winner Donald Knuth. He then worked as Chief Scientist at AltaVista, the once-famous first-generation search engine company (later acquired by Yahoo). After that, he joined IBM Research in New York to build an enterprise search platform. Since 2012, he has been with Google as a Distinguished Scientist. He is also a fellow of both ACM (Association of Computing Machinery) and IEEE (Institute of Electrical and Electronics Engineers).

Andrei’s paper can be said to have laid a solid foundation for query keyword classification. After that, many researchers have focused on how to automate classification and define more precise user intent.

Explanation of Keyword Categories in Queries #

Let me start with a very famous article by Andre. Before web search became a mainstream means of information retrieval, traditional information retrieval believed that the main purpose of a query was to fulfill an abstract “information need.” In the world of traditional information retrieval, the main applications were library searches or searches conducted by organizations such as government institutions and schools. Therefore, in such scenarios, it made sense to assume that every query primarily aimed to satisfy a certain “information need.”

However, as early as 2002, Andre believed that this traditional assumption was no longer suitable for the internet era. He started to categorize the purposes represented by query keywords into three main categories:

Navigational Intent;
Informational Intent;
Transactional Intent.

In the more than ten years that followed, these three categories of query keywords became the cornerstone of research and practice in this field. Let’s first take a look at the meaning of these categories.

The first category refers to queries with navigational intent, i.e., queries whose goal is to reach a certain website. This could be a website that the user has visited before or a website the user assumes exists based on the submitted query keyword. This category includes company names (such as “Microsoft”), people’s names (such as “Obama”), or the names of certain services (such as “FedEx”), among others.

An important characteristic of these types of query keywords is that in most cases, they correspond to a unique or very limited number of “standard answer” websites. For example, when searching for “Microsoft Corporation,” the desired result would be the official website of Microsoft Corporation. On the other hand, certain “information integration” websites that list all U.S. presidents would also be considered acceptable answers when searching for “Obama.”

The second category refers to queries with informational intent, i.e., queries whose goal is to gather information. This category of queries is very similar to traditional information retrieval. It is worth mentioning that based on subsequent research findings, the goals encompassed by these query keywords include not only finding web pages with authoritative content but also finding “hubs” websites that list authoritative information in colloquial terms.

The third category refers to queries with transactional intent, i.e., queries whose goal is to arrive at an intermediary site in order to complete a transaction. The primary goal of these types of query keywords is related to shopping. Nowadays, our attitude towards e-commerce is considered very natural. However, over a decade ago, in the search research field dominated by traditional information retrieval, proposing queries with a “transaction” type of intent was quite innovative.

Of course, if this categorization only existed on a conceptual level, it would not be of much significance. Andre conducted a research study using the search engine AltaVista, gathering feedback from more than 3,000 users. Considering this study took place in 2001, it can be considered a large-scale research effort. The results of this survey are as follows: among the user-submitted information, navigation-related query keywords accounted for 26%, transaction-related query keywords accounted for 24%, and the remaining nearly 50% were information-related query keywords. Log analysis further confirmed this data.

As you can see, this research, which categorizes query keywords, is a necessary step in modeling user behavior. Therefore, many researchers quickly realized the value of query keyword classification. However, relying solely on user feedback to obtain this type of information has become increasingly difficult.

There are three main reasons for this. First, it is not possible to rely on users to report the intent of all their keywords. Second, manual annotation is impossible when faced with billions of user input query keywords. Lastly, Andre’s three-category classification is still too simplistic, and finer-grained user intent is desired in practical applications.

Converting the query keyword classification problem into a standard machine learning task is quite straightforward. Specifically, what needs to be done here is to convert query keyword classification into a supervised learning task. Here, each query keyword is a data sample, and the response variable is the corresponding category. The specific scenario depends on whether our task is to simply divide query keywords into several categories and consider these categories as mutually independent, or to consider these categories as potentially coexisting.

In the simplest assumption, query keyword classification is a multi-class classification problem that can be solved using generic multi-class classifiers such as Support Vector Machines (SVM), Random Forests, and Neural Networks.

For most supervised learning tasks, one of the most important components is feature selection. In the years of research and development that followed, a significant portion focused on trying different features to see if they improved classification accuracy.

Past research has repeatedly shown that the following types of features are very effective.

The first type of feature is the information contained in the query keyword itself. For example, if the query keyword already contains known names of people or companies, the classification result is unlikely to be in the transaction intent category. In other words, there is some association between certain words or phrases in the query keyword and the category, and this association can be directly reflected.

The second type of feature is the information related to the query keyword that the search engine returns in its pages. Imagine if searching for “Barack Obama” returns only Wikipedia pages and Obama Foundation pages, it may be difficult for those pages to contain any commercial purchase information. On the other hand, for the query keyword “Canon camera,” the returned pages are likely to be e-commerce websites’ product information, which can more accurately determine the classification of “Canon camera.”

The third type of feature is user behavior information, such as which websites users click on and stay on after entering a query keyword. Generally, high click-through rates and long dwell times on certain websites indicate that these websites are more relevant in the returned results. Using these websites as a representation of the content represented by the query keyword may be more reliable.

In practical applications, query keyword classification is often challenging. This is because in ordinary modern search engines, around one-third, or even more, of the keywords that appear each day have never been seen before. Therefore, how to handle previously unseen keywords and how to handle low-frequency keywords in the long tail has become an important factor in improving the accuracy of search results. I won’t delve into these topics today, but if you’re interested, you can check out relevant research papers.

Summary #

Today I have talked to you about a very basic but also crucial aspect in modern search technology, which is the problem of user intent categorization in query keyword understanding. You can see that query keywords are broadly divided into three categories: informational intent, transactional intent, and navigational intent.

Let’s review the key points together: first, a brief introduction to the historical background of query keyword categorization, where Andrei Broder’s paper laid a solid foundation for this field. Second, a detailed explanation of the main categories and how to achieve automation through the construction of multi-class classifiers.

Finally, I’ll leave you with a thought-provoking question: how should we use the results of query keyword categorization in machine learning ranking algorithms?

Further reading: A taxonomy of web search