007 Fine Reading of the Best Research Paper at Iccv 2017

007 Fine Reading of the Best Research Paper at ICCV 2017 #

ICCV (International Conference on Computer Vision) is a top-level computer vision conference held every two years. It has a history of 30 years since its inception in 1987. The ICCV 2017 conference took place in Venice, Italy from October 22nd to 29th.

At each ICCV conference, two papers are selected as the Best Research Paper and Best Student Paper from numerous academic papers. The Best Paper Award at ICCV is also known as the “Marr Prize,” named in honor of David Marr, a British psychologist and neuroscientist. Marr integrated research findings from psychology, artificial intelligence, and neuroscience to propose a new theory of visual processing, and he is considered the founder of computational neuroscience.

Today, I will analyze in depth the Best Research Paper at ICCV 2017, “Mask R-CNN.” This paper is a comprehensive work that introduces a new method for addressing the tasks of “Object Detection,” “Semantic Segmentation,” and “Instance Segmentation” simultaneously.

What does this mean? In simple terms, given an input image, the model proposed in this paper can analyze the image to determine which objects are present, such as a cat or a dog. It can also locate these objects within the entire image. Moreover, it can assign each pixel in the image to a specific object, allowing for precise segmentation of objects from the image.

Author Group Information Introduction #

The authors of this paper all come from Facebook’s Artificial Intelligence Research Institute (Facebook AI Research).

The first author is Dr. Kaiming He, a rising star in the field of computer vision in recent years. He joined Facebook AI Research in 2016, and previously conducted research in computer vision at Microsoft Research Asia. He is also the recipient of the Best Paper Award at CVPR 2016 and CVPR 2009. Currently, Dr. He has made three major contributions in the field of computer vision.

First, the ResNet architecture invented by him and other collaborators has become an important force in deep learning for computer vision since 2016. It has been applied to fields beyond computer vision, such as machine translation and AlphaGo. The related papers have been cited over 5,000 times.

Second, the Faster R-CNN technique developed by him and other collaborators, published in NIPS 2015, is an important technique for object recognition and semantic analysis in images. It forms the basis of the paper we are discussing today, and has been cited over 2,000 times.

Third, in 2015, he and other collaborators published the paper “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification” at ICCV, which studied an improved ReLU (Rectified Linear Unit) structure to achieve better results in ImageNet classification. This paper has been cited nearly 2,000 times.

The second author, Georgia Gkioxari, is currently a postdoctoral researcher at Facebook AI Research. Georgia can be said to come from a prestigious background, having graduated from the University of California, Berkeley, before joining Facebook. She studied under the computer vision expert Jitendra Malik. Georgia has also interned at Google Brain and Google Research Labs. In the past few years, she has published multiple high-quality papers in the field of computer vision.

The third author, Piotr Dollár, is a manager at Facebook AI Research. He received his PhD from the University of California, San Diego in 2007 and joined Facebook in 2014, after working at Microsoft Research. Piotr has been engaged in research in computer vision for a long time.

The last author, Ross Girshick, is a scientist at Facebook AI Research. He received his PhD in Computer Science from the University of Chicago in 2012. Ross has also worked at Microsoft Research and served as a postdoctoral researcher in the laboratory of computer vision expert Jitendra Malik.

Main Contributions of the Paper #

First, let’s take a look at the main contributions of this paper. We need to understand the problem this paper primarily addresses.

As we briefly mentioned earlier, the problem this paper aims to solve is the integration of three tasks: object recognition, semantic segmentation, and instance segmentation of input images. In a previous work, “Faster R-CNN” [1] had already tackled the first two tasks. Therefore, this paper is essentially an extension of Faster R-CNN in terms of its logical framework. However, this extension is not as straightforward as it seems. To solve the task of instance segmentation, Mask R-CNN proposes an innovation in the deep learning network structure, which is an important contribution of this paper.

The model proposed in this paper not only performs exceptionally well on the standard instance segmentation dataset COCO, surpassing all previously proposed models, but it can also be easily extended to other tasks such as “human pose estimation,” thereby establishing the status of Mask R-CNN as a versatile framework.

Core Methods of the Paper #

To understand the core idea of Mask R-CNN, we must first briefly understand some basic principles of Faster R-CNN. As mentioned earlier, Mask R-CNN is an improvement and extension of Faster R-CNN.

For each candidate object in an input image, Faster R-CNN has two outputs, a label (such as cat, dog, horse, etc.) and a bounding box that represents the position of the object in the image. The first output is a classification problem, while the second output is a regression problem.

Faster R-CNN consists of two stages. The first stage is called the Region Proposal Network (RPN), which aims to propose potential bounding boxes in the image. The second stage uses a technique called RoIPool to extract features from these candidate boxes for label classification and bounding box regression. Some characteristics of these two stages can be shared.

The general process of the Region Proposal Network is as follows: The original input image undergoes classic convolution layer transformations to form an image feature layer. On this new image feature layer, the model uses a moving window to model the regions. The moving window has three tasks to consider.

First, the feature covered by the moving window is transformed to an intermediate layer, which is then used for both object classification and position localization. Second, the moving window is used to propose a candidate region, also referred to as a Region of Interest (ROI), which is also involved in predicting the localization information mentioned earlier.

After the Region Proposal Network “boxes” the approximate region and category of an object, the model uses an Object Detection Network to perform the final object detection. In this case, object detection actually uses the architecture of Fast R-CNN[2]. This is why “Faster” is used in the name Faster R-CNN to distinguish it. The contribution of Faster R-CNN lies in the fact that the Region Proposal Network and part of Fast R-CNN, namely the object detection part, share parameters or share the network architecture, which helps with acceleration.

In the first part, Mask R-CNN fully uses the Region Proposal Network proposed by Faster R-CNN, and makes changes to the second part. In other words, it not only outputs the categories and relative positions of regions in the second part, but also outputs specific pixel segmentation. However, unlike many similar works, pixel segmentation, category classification, and position prediction are three independent tasks that are not mutually dependent. The authors believe that this is a key factor for the success of Mask R-CNN. In contrast to previous works, pixel segmentation becomes dependent on category classification, which leads to interference among these tasks.

When Mask R-CNN performs pixel segmentation, it needs to preserve the spatial relationship of the original image since segmentation is performed on the original image. This requirement is not present in category classification and position prediction. In Faster R-CNN, category classification and position prediction rely on a compressed intermediate layer that does not require this information. Obviously, Mask R-CNN cannot solely rely on this compressed layer. In this paper, the authors propose a technique called RoIAlign to ensure that the extracted features reflect on the original pixels. If you are interested in this part, I recommend you read the paper for more details.

Experimental Results #

The authors conducted experiments using Mask R-CNN on the popular image object detection datasets COCO 2015 and COCO 2016. Compared to the previous champions of these two competitions, the experimental results show that Mask R-CNN significantly improves accuracy. In terms of “Average Precision” metric, Mask R-CNN outperforms the best result of COCO 2015 by nearly 13% and the best result of COCO 2016 by 4%. This improvement is quite remarkable. In the experimental results, the authors meticulously tested the effectiveness of each component in Mask R-CNN. The separation of the three tasks and the RoIAlign method both have significant impact, proving that these model components are necessary steps for achieving excellent results.

Summary #

Today I talked to you about the best research paper from ICCV 2017. This article introduces the latest algorithm in image object recognition, Mask R-CNN.

Let’s review the key points together: First, we briefly introduced the author information of this article. Second, we detailed the problem that this article aims to solve and its contributions. Third, we briefly introduced the core content of the proposed method in the article.

Finally, I leave you with a question to think about: why do you think Mask R-CNN, along with some previous work, divides the object detection task into two steps, where the first step analyzes a large rectangular box and the second step performs object detection? Are both of these steps necessary?

References

  1. Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 39, 6 (June 2017), 1137-1149, 2017.
  2. Ross Girshick. Fast R-CNN. Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV) (ICCV ‘15). IEEE Computer Society, Washington, DC, USA, 1440-1448, 2015.