009 How to Apply Deep Reinforcement Learning to Visual Question Answering Systems

009 How to Apply Deep Reinforcement Learning to Visual Question Answering Systems #

This week, we will analyze a paper from ICCV 2017. On Monday and Wednesday, we discussed the best research paper and the best student paper, respectively. Today, we will share a completely different article titled “Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning,” which explores how to use deep reinforcement learning to solve visual question answering systems.

Author Information #

The first author, Abhishek Das, is a current PhD student at the Georgia Institute of Technology. He interned at Facebook AI Research in 2017 and 2018, and is the recipient of research fellowships from Adobe and Snapchat. He has published several papers in the field of intelligent systems, particularly in the area of using reinforcement learning to study intelligent robotic conversation systems.

Co-first author Satwik Kottur is a fourth-year PhD student from Carnegie Mellon University, specializing in computer vision, natural language, and machine learning. He interned with the research team at Snapchat during the summer of 2016, researching personalized questions in dialog systems. He also interned at Facebook Research in the summer of 2017, working on visual dialog systems. Satwik has published multiple high-quality research papers in top international conferences such as ICCV 2017, ICML 2017, IJCAI 2017, CVPR 2017, NIPS 2017, and EMNLP 2017, making him an emerging academic star.

The third author, José M. F. Moura, is Satwik’s advisor at Carnegie Mellon University. José is a member of the National Academy of Engineering and a fellow of IEEE, with extensive research experience in signal processing, big data, and data science. He was elected as IEEE President for 2018 and is responsible for the next phase of IEEE’s development.

The fourth author, Stefan Lee, is a research scientist from the Georgia Institute of Technology. He previously worked at Virginia Tech, conducting research in various areas including computer vision and natural language processing. Stefan received his PhD in 2016 from the Indiana University School of Informatics and Computing.

The fifth author, Dhruv Batra, is currently a scientist at Facebook Research and an assistant professor at the Georgia Institute of Technology. Dhruv obtained his PhD in 2010 from Carnegie Mellon University. From 2010 to 2012, he served as a research assistant professor at the Toyota Technological Institute at Chicago. From 2013 to 2016, he was a faculty member at the University of Virginia. Dhruv’s research focuses on artificial intelligence, particularly in the areas of vision systems and human-computer interaction. Stefan Lee, the fourth author of the paper, has been a long-term research collaborator with Dhruv, and together they have published multiple high-quality papers, including this one.

Main Contributions of the Paper #

Let’s first take a look at the main contributions of this paper to understand the problem it primarily addresses.

This paper is based on a virtual “game”.

First, we have two “agents”: one called “Q-Bot” and the other called “A-Bot”. The rules of the game are as follows: Initially, A-Bot receives an image I, while Q-Bot receives a textual description c of the image and does not know the image itself. Then, Q-Bot starts asking A-Bot various questions about the image, and A-Bot answers the questions to help Q-Bot further understand the image. The ultimate goal for Q-Bot is to “guess” the image, meaning it can “retrieve” the image from a database. In practical terms, this step can be measured by the difference between the “vector describing the image” and the “vector describing the actual image”. The smaller the difference, the more successful the retrieval.

As you can see, this is a difficult problem. Q-Bot must find clues from the textual descriptions provided by A-Bot and be able to ask meaningful questions. A-Bot needs to understand what information Q-Bot has understood so far in order to assist Q-Bot successfully.

The entire game, or task, is often referred to as a “Cooperative Visual Dialog System.” The main contribution of this paper is the first utilization of deep reinforcement learning to model such a system. Moreover, the proposed solution significantly improves the accuracy compared to previous non-reinforcement learning models.

Core Methods of the Paper #

So, since we want to model the whole problem using deep reinforcement learning, we definitely need to define some components of reinforcement learning.

First, let’s take a look at the “Action” of the model. The action space for the two robots is the vocabulary of natural language. Because in this game or in each round of reinforcement learning, the two robots need to take the next action, which is the question sentence, based on the current state. This is a discrete action space. In addition to this, the Q robot needs to update its understanding of the image vector after each round. So, this is a continuous action space.

Second, let’s take a look at the “State” of the model. For the Q robot, the state of each round is a collection of this information, including the initial description of the image provided by robot A, as well as every question and answer sentence in the dialogue so far. The state space of robot A includes the initial image itself, the description of the image, and all the dialogues so far.

Third, let’s take a look at the “Policy” of the model. For robot A and robot Q, both need to evaluate the possibility of the next sentence based on the current state. Here, the evaluation mechanism is actually learned by two neural networks for robot A and robot Q respectively. At the same time, the Q robot needs a neural network to update its understanding of the image based on the answers provided by robot A.

Fourth, let’s take a look at the “Environment” and “Reward” of the model. In this game, both robots will receive the same reward, which is based on the distance between the vector representing Q robot’s understanding of the image and the vector representing the true expression of the image, or more accurately, the change in distance.

These are the settings of the whole model.

Now, let’s take a look at some details of the policy neural networks of the two models. First, for the Q robot, there are four important components. First, the Q robot combines the question asked by itself and the answer given by robot A in the current round as a combination, and uses LSTM to encode them to generate an intermediate variable F. Second, the F of the current step is combined with all the previous F, and then goes through another LSTM to generate an intermediate variable S. Then, the third step, based on this S, we generate the next sentence and update the understanding of the image. In other words, F is actually a description of the historical states, while S is a compressed current description information, and we use S as a springboard for the next step. The architecture of the policy neural network for robot A is very similar, so it will not be elaborated here, the difference is that it does not need to generate an understanding of the image.

The whole model uses the popular REINFORCE algorithm in deep reinforcement learning to estimate the parameters of the model.

There are actually many technical details in this paper. In today’s presentation, we can only summarize them from a relatively high level. If you are interested, be sure to read the original paper.

Experimental Results of the Method #

The authors conducted experiments on a dataset called VisDial. This dataset consists of 68,000 images extracted from the COCO dataset that was mentioned earlier, and it provides over 680,000 question-answer pairs. It can be said that this dataset is relatively large.

The paper compared the results of using traditional supervised learning methods and “Curriculum Learning” approaches. In terms of effectiveness, the impact of reinforcement learning is quite evident. The most direct effect is that reinforcement learning can generate conversation effects that are similar to real conversations, while other methods, such as supervised learning, can basically only produce “infinite loop” dialogues, which is not ideal. However, from the perspective of image extraction, although reinforcement learning performs better than supervised learning, the difference is not particularly significant, and it can be considered that the current gap is still within the margin of error.

Summary #

Today, I have discussed an interesting article from ICCV 2017 with you. This article introduces how to use deep reinforcement learning to build a model that can understand the dialogue between two robots and comprehend image information.

Let’s recap the main points: First, we briefly introduced the author group information of this article. Second, we provided a detailed introduction to the problem this article aims to solve and its contributions. Third, we focused on explaining the core content of the proposed method in the article.

Lastly, I’ll leave you with a question to ponder: What do you think is the difficulty of applying reinforcement learning in such a dialogue scenario?