006 Fine Reading of the Best Short Paper at Emnlp 2017

006 Fine Reading of the Best Short Paper at EMNLP 2017 #

At this year’s EMNLP conference, there were two types of research papers published. One category consisted of 8-page long papers that presented more complete research results, while the other category consisted of 4-page short papers that presented newer research results that required further scrutiny. The conference selected two best papers from the long papers and one best paper from the short papers.

Previously, we discussed the two best long papers, and today, I’ll take you through an in-depth analysis of the best short paper at EMNLP 2017, “Natural Language Does Not Merge ‘Naturally’ in Multi-Agent Dialog.” Although this paper is categorized as a best short paper, the authors have already published a longer version on arXiv, so my explanation today will be based on the longer version from arXiv.

The main focus of this paper is how to avoid generating “unnatural” dialogues in multi-agent conversations. Previous research on dialogues mostly emphasized accuracy, but in reality, the dialogues generated by machines are unnatural and different from how humans communicate. This paper aims to explore how these unnatural dialogues are produced and whether there are ways to avoid such results.

Author Information #

The first author, Satwik Kottur, is a fourth-year Ph.D. student from Carnegie Mellon University, specializing in computer vision, natural language, and machine learning. During the summer of 2016, he interned at Snapchat’s research team, working on personalized issues in conversation systems. In the summer of 2017, he interned at Facebook Research, conducting research on visual dialog systems. Over the past two years, Satwik has published several high-quality research papers at top international conferences such as ICML 2017, IJCAI 2017, CVPR 2017, ICCV 2017, and NIPS 2017, including this best short paper at EMNLP 2017. He can be considered as a rising star in the academic field.

The second author, José M. F. Moura, is Satwik’s advisor at Carnegie Mellon University. José is a fellow of the National Academy of Engineering (NAE) and a fellow of the Institute of Electrical and Electronics Engineers (IEEE). He has been engaged in research on signal processing, big data, and data science for a long time. He was elected as the IEEE President for 2018 and is responsible for the future development of IEEE.

The third author, Stefan Lee, is a research scientist from Georgia Institute of Technology. Previously, he worked at Virginia Tech and has long been engaged in research in areas such as computer vision and natural language processing. Stefan received his Ph.D. in Computer Science from Indiana University in 2016.

The fourth author, Dhruv Batra, is currently a scientist at Facebook Research and an assistant professor at Georgia Institute of Technology. Dhruv received his Ph.D. from Carnegie Mellon University in 2010. From 2010 to 2012, he worked as a research assistant professor at the Toyota Technological Institute in Chicago. From 2013 to 2016, he taught at the University of Virginia. Dhruv has been involved in research on artificial intelligence, specifically in visual systems and human-computer interaction. Stefan, the third author of the article, has been a long-term research collaborator with Dhruv, and together they have published multiple high-quality papers, including this one.

Main Contributions of the Paper #

Let’s first take a look at what problem this article primarily addresses.

One core scenario, or goal, of artificial intelligence is to create a goal-driven automatic dialogue system. Specifically, in such a system, robots are able to perceive their environment (including visual, auditory, and other sensory inputs) and engage in conversations with humans or other robots using natural language in order to achieve certain objectives.

Currently, there are two main approaches to research on goal-driven automatic dialogue systems.

One approach treats the entire problem as a static supervised learning task, aiming to model the dialogue system using a large amount of data and neural dialogue models. Although this approach has achieved some success in recent years, it still struggles to address a fundamental challenge: the generated “dialogue” does not resemble human conversation and lacks many characteristics of real language.

The other approach considers the task of learning dialogue systems as a continuous process and models the entire dialogue system using reinforcement learning.

This paper attempts to explore under what circumstances a robot can learn language similar to human language. One key finding of the paper is that natural language is not naturally occurring. In the current state of research, the occurrence of natural language remains an open question without a definitive answer. We can say that this is the main contribution of this excellent short paper.

Core Method of the Paper #

The entire article is actually based on a virtual robot interaction scenario, which is an environment where two robots engage in a dialogue. In this environment, there are a very limited number of objects, with each object having three attributes (color, shape, and style). Each attribute has four possible values, resulting in a total of 64 objects in this virtual environment.

The interaction task is essentially a “guessing game” between the two robots. To differentiate, we refer to the two robots as the Q robot and the A robot. At the beginning of the guessing game, the A robot is given an object, which is a combination of certain implementations of the three attributes. The Q robot does not know this object. At this point, the Q robot is given the names of two attributes and needs to guess the values corresponding to these attributes for the object that A has.

During this “game,” A does not know the actual values of the two attributes held by Q, and Q does not know the object A has or the values corresponding to the attributes of the object. Therefore, dialogue is the key factor for Q to succeed.

In this article, the Q and A game is modeled using reinforcement learning. Q maintains a set of parameters to record the current state. This set of states includes the attributes that need to be guessed initially, as well as all of Q’s answers and A’s questions up to the current state. Similarly, A also maintains a set of states to record information up to the current position. The ultimate reinforcement of this reinforcement learning is that there is a positive feedback of 1 when the final prediction is completely correct, and a negative feedback of 10 when there is an error.

The Q and A models both have three modules: listen, speak, and predict. Taking Q as an example, the “listen” module starts from the task of guessing the attributes and then accepts A’s statements in each subsequent step to update its internal state. The “speak” module decides the next statement to be spoken based on the current internal state. Finally, the “predict” module predicts the final attribute value based on all the states.

The structure of the A robot is symmetrical. Each module itself is an LSTM (Long Short-Term Memory) model. Of course, all these LSTM models have different parameters. The entire model uses the REINFORCE algorithm (also known as “vanilla” policy gradient) to learn the parameters, and the specific implementation uses the PyTorch software package.

Experimental Results of the Method #

In the proposed method, the authors demonstrated that Q can quickly make predictions with relatively high accuracy and generate “language” in interaction with A. However, it is regrettable that through observation, the authors found that such “language” is often not natural. The most intuitive case is that A can ignore various reactions from Q and directly “expose” its internal information to Q through some encoding, so that Q can quickly win the game and achieve almost perfect prediction results. This is obviously not the desired outcome.

The authors found that in the case where the vocabulary size is very large, this situation is particularly prone to occur, that is, A exposes its entire state to Q. Therefore, the authors assume that in order to have meaningful communication, the number of vocabulary must not be too large.

Therefore, the authors adopted a strategy of limiting the number of vocabulary, making the number of vocabulary equal to the number of possible attribute values and the number of attributes. This limits the complexity of communication in perfect conditions, making it impossible for A to over-communicate. However, this strategy can make accurate judgments for a single attribute, but cannot make judgments for the superposition of attributes (because Q ultimately needs to guess two attributes).

One solution provided in the article is to make A forget its past states, forcing A to learn to use the same set of states to express the same meaning instead of potentially using new states. Under such constraints of limitations and no memory, the conversation between A and Q exhibits significant stacking characteristics of natural language, and shows an accuracy rate close to twice on unseen attributes, which previous methods could not achieve.

Summary #

Today, I talked to you about the best short paper at EMNLP 2017. This paper discusses how to make the dialogue of a robot dialogue system more human-like.

This paper is also the first to analyze dialogue systems from the perspective of naturalness, rather than just prediction accuracy. One key point of the paper is that if we want the dialogue to be natural, we must avoid the robot simply leaking the answer to the other party, or having a too large vocabulary.

Let’s review the main points together: First, I briefly introduced the author information of this paper. The authors have published many high-quality research papers in relevant fields. Second, the paper argues that the appearance of natural language in multi-agent dialogue is not natural. Third, the paper proposes that under the constraints of vocabulary size and no memory, robot dialogue can exhibit certain features of natural language.

Finally, I leave you with a question to ponder: The paper talks about a relatively simple dialogue scenario with a limited vocabulary. If it were a real conversation between people or between machines, how do we determine the appropriate size of the vocabulary?

Glossary:

ICML 2017, International Conference on Machine Learning.

IJCAI 2017, International Joint Conference on Artificial Intelligence.

CVPR 2017, Conference on Computer Vision and Pattern Recognition.

ICCV 2017, International Conference on Computer Vision.

NIPS 2017, Annual Conference on Neural Information Processing Systems.

Further reading: Natural Language Does Not Merge ‘Naturally’ in Multi-Agent Dialog