002 Fine Reading of the Best Research Paper at Kdd 2017

002 Fine Reading of the Best Research Paper at KDD 2017 #

In our previous article, we introduced the Time Series Analysis Award at KDD conference. Another highlight of the conference is the Best Paper Award, which includes two categories: Best Research Paper and Best Applied Data Science Paper. Today, let’s talk about the former.

Every year, the conference selects the most innovative and valuable research papers from numerous academic papers and awards the first and second places for the Best Research Paper. Based on the experience of the past decade, KDD’s Best Research Papers have had a pioneering impact on research in many fields. Therefore, whether from the perspective of reading classic literature or learning about the latest research results, it is a great choice to carefully analyze and discuss the Best Research Papers every year.

Today, I will take you on a thorough analysis of the Best Research Paper at KDD 2017 titled “Accelerating Innovation Through Analogy Mining”.

Author Group Information Introduction #

First author Tom Hope is a computer science doctoral student in his third year at The Hebrew University of Jerusalem. He is also a senior data scientist at Intel Israel, with research interests in various aspects of deep learning. Currently, he is writing a concise technical book on deep learning based on TensorFlow.

Fourth author Dafna Shahaf is Tom Hope’s doctoral advisor and currently an assistant professor in the Computer Science department at The Hebrew University. Dafna received her PhD from Carnegie Mellon University in 2012. She has interned at Microsoft Research and Fujitsu Labs, and conducted postdoctoral research at Stanford University. Dafna’s papers have won the Best Research Paper award at KDD in 2010, establishing her as a leading researcher in machine learning.

Second author Joel Chan is a scientist from the School of Computer Science at Carnegie Mellon University. Joel obtained his PhD in cognitive psychology from the University of Pittsburgh in 2014 and has been conducting research in the field of human-computer interaction.

Third author Aniket Kittur is an associate professor from the School of Computer Science at Carnegie Mellon University. He received his PhD in cognitive psychology from the University of California, Los Angeles in 2009 and has been teaching at Carnegie Mellon University since then.

From the overall composition of the author group, it can be seen that this article is a typical interdisciplinary result combining machine learning technology and human-computer interaction.

Main Contributions of the Paper #

Let’s first take a look at the main contributions of this article. Of course, to deeply understand the contributions of this article, we need to first understand the problem that this article mainly addresses.

This article mainly discusses an important step for facilitating innovation, which is how to find suitable and effective analogical cases. What are analogical cases? Throughout the history of human development, especially in the process of scientific and technological innovation, many important inventions and discoveries have been made because scientists at that time borrowed solutions from similar scenarios or gained inspiration from those scenarios to make further innovations. In this step, borrowing from similar scenarios is often referred to as analogy.

For example, the Wright brothers were inspired by bicycles and created a gliding aircraft. Another example is Salvador Luria, a Nobel laureate in Physiology or Medicine, who discovered the rules of bacterial gene mutations by observing the operation of slot machines. Moreover, analogy is also ubiquitous in the laws of many countries, whether written or in practice. Therefore, we can see that how to find appropriate analogies and gain inspiration from them may be a key factor in innovation.

In today’s internet era, we already have many databases and repositories that can serve as important sources of analogical inspiration. For instance, Google Scholar has millions of papers and patents; OpenIDEO has hundreds of analyses on social issues; Quirky has over two million product concepts; InnoCentive has over forty thousand solutions in the fields of society, politics, and technology; the United States Patent and Trademark Office has over nine million patents, and there are many similar examples.

At the same time, the abundance of data has brought great challenges to finding inspiration. Finding potential analogical scenarios in these millions-level information repositories has become a major bottleneck that hinders the speed of innovation.

Finding analogical scenarios presents a major challenge, which is how to define “similarity” or “similarity”. If we only consider some surface features, the similarity between many scenarios is very low. Therefore, good analogical scenarios must be able to identify cases that have deep-level similarities beyond superficial similarities. Another challenge is whether there can be a very rich data structure for supporting reasoning in finding analogical scenarios. If such a data structure exists, there are often simpler methods available.

The focus of this paper is how to extract such analogical scenarios from massive unstructured or weakly structured text data. It can be seen that this is indeed a huge challenge.

Having understood the purpose of this paper, I will now directly summarize its contributions, which is the proposal of an automatic method for mining analogical scenarios from massive unstructured text data. This article focuses on product information data. The authors have validated the proposed method with actual data and demonstrated its effectiveness compared to some previous text processing methods.

Core Method of the Paper #

After understanding the purpose and contribution of this article, let’s analyze the method proposed by the authors.

Firstly, the authors put forward a set of concepts called “Purpose” and “Mechanism”. What does “Purpose” mean? It refers to the problem that the current product is intended to solve. And what about “Mechanism”? It refers to the means or methods used by the current product to solve this problem. For a product, if we can identify its purpose and mechanism, it becomes easier to find analogies.

For example, we can use different mechanisms for the same purpose or the same mechanism for different problems. The authors believe that “this classification of product information is in line with many engineering design processes and is a necessary part of the innovation process.”

With this idea in mind, the next natural step is how to learn the purpose and mechanism from the data, and how to automatically mine the purpose and mechanism of massive product information. In order to learn this information, the authors propose a supervised leanring mechanism relying on labeled data. Specifically, the authors assign each sentence or phrase in the text to online workers on Amazon Mechanical Turk to label whether it is purpose information or mechanism information. In other words, the authors train the proposed algorithm with labeled data.

First, we have a set of text, and for each text, we collect K purpose annotations and K mechanism annotations. At this point, we define a set of “Purpose Annotation” vectors, which are essentially a group of 0s and 1s vectors. When a word in the original text is identified as a purpose, the corresponding element in this vector is set to 1, otherwise it is set to 0. Similarly, we can also define “Mechanism Annotation” vectors. Because we have K annotations, we also have K sets of “Purpose Annotation” vectors and “Mechanism Annotation” vectors. These two sets of vectors can be seen as a vector representation of the original label information.

The next step is to generate unique purpose and mechanism vectors from each labeled document. The method used in this article is to use the embedding vectors of each word to obtain this unique vector.

The specific method is as follows. First, for each annotation (a total of K), we collect the embedding vectors of words belonging to this annotation and concatenate these embedding vectors. Then we calculate the TF-IDF (Term Frequency–Inverse Document Frequency) values of the words corresponding to this group of concatenated vectors, and take the embedding vectors corresponding to the words with the highest TF-IDF values. After weighted averaging, we obtain the corresponding unique purpose vector or mechanism vector. The authors found that this weighted method using TF-IDF values can more effectively express various important information in the text. Note that this step depends on the document labels, which means we can only construct this way on the training data.

So far, we have described the steps to obtain the purpose and mechanism vectors corresponding to the text from the text. So how to extract purpose and mechanism vectors for unknown documents based on this data and vectors? The authors use a deep model called RNN (Recurrent Neural Network), specifically a bidirectional RNN with a GRU (Gated Recurrent Unit).

I won’t go into the details here, but the overall idea is to obtain a group of hidden representations (intermediate parameters) of documents based on the embedding vector information of the documents. Then, these hidden representations can be used to predict the purpose and mechanism vectors. Note that because there are two sets of targets to predict, the purpose vector and the mechanism vector, there should be at least two sets of parameters here.

In addition to predicting the purpose and mechanism vectors of the documents, the authors also propose using a few keywords to explain the mechanism of these two sets of variables. Specifically, a new learning objective function is established, aiming to minimize the error between the purpose or mechanism vectors and the embedding vectors corresponding to a few keywords.

Experimental Results of the Method #

The authors used the Quirky dataset and annotated more than eight thousand sets of product information using Amazon Mechanical Turk. First, they tested whether using the learned purpose and mechanism vectors could easily extract corresponding analogical information from massive data. Here, the authors directly expressed the problem by concatenating the purpose and mechanism vectors. The result was very significant. In the top 1% to 25% of the extraction results, both precision and recall were 10% to 20% better than previous standard text processing methods such as LDA, TF-IDF, and global embedding vectors, which is very effective.

The authors also tested whether the proposed method could recommend better analogical scenarios to users. Here, the article asked 38 virtual workers of Amazon Mechanical Turk to rate the ideas of 12 products. Without knowing the recommended method, the virtual workers believed that the method proposed in this article could recommend more novel and deeper similar scenarios. This partially solves the problem mentioned earlier in the article.

Summary #

Today I discussed the Best Research Paper of KDD 2017, which presented an automated approach to mine analogy information, paving the way for rapid innovation. Let’s review the key points:

I briefly introduced the authors of the paper, helping you understand that it combines machine learning and human-computer interaction research.
I provided a detailed explanation of the problem that the paper aims to address and its contributions.
I gave a brief overview of the core content of the proposed method in the paper.

Finally, I leave you with a question to ponder: Is it possible to achieve the same results without using labeled information, using a completely unsupervised approach?

Further reading: Accelerating Innovation Through Analogy Mining