008 Fine Reading of the Best Student Paper at Iccv 2017

008 Fine Reading of the Best Student Paper at ICCV 2017 #

On Monday, we carefully analyzed the best research paper of ICCV 2017, “Mask R-CNN”. Today, we will share the best student paper from ICCV 2017 titled “Focal Loss for Dense Object Detection” (Link to the Paper).

It can be said that this paper is the twin brother of the best paper we shared on Monday. First, the author group of this paper is also primarily from Facebook AI Research. Secondly, this paper addresses a similar problem, which is object recognition and semantic segmentation, but it does not deal with the issue of instance segmentation.

Author Information #

Except for the first author, all the authors of this paper come from the Facebook Artificial Intelligence Research (FAIR) institute.

The first author, Tsung-Yi Lin, currently works at Google Brain but interned at the Facebook Artificial Intelligence Research (FAIR) institute when the paper was published. Tsung-Yi Lin obtained his bachelor’s degree from National Taiwan University, his master’s degree from the University of California, San Diego, and he just graduated with a Ph.D. from Cornell University in 2017. During his Ph.D., he was mentored by computer vision expert Serge Belongie and published several high-quality computer vision papers.

The second author, Priya Goyal, is a research engineer at the Facebook Artificial Intelligence Research (FAIR) institute. Prior to joining Facebook, Priya obtained her bachelor’s and master’s degrees from the Indian Institute of Technology.

The third author, Ross Girshick, the fourth author, Kaiming He, and the last author, Piotr Dollár, are also authors of the best research paper from Monday, as mentioned before. You can go back and learn more about them.

Main Contributions of the Paper #

Let’s first take a look at the main contributions of this paper.

As we mentioned earlier, the problem addressed in this paper is the tasks of object recognition and semantic segmentation of input images. There are two main approaches to solving this problem, and both approaches are continuously evolving.

The first approach is to directly start from the input image, hoping to extract relevant features from the input image in order to determine whether the current image region belongs to a certain object and to find the position of the rectangular box for locating the object in one go.

Although this approach is intuitive, it has a fatal problem: for an input image, a large number of regions do not actually contain the target object, so they can be considered as “negative instances” in the learning process. How to effectively learn from such an “imbalanced” data set is a problem that needs to be considered in this approach.

Because of this factor, researchers began to consider another approach: first learn a neural network to find some candidate regions, and then, in the second stage, determine the object category and the position of the rectangular box based on the candidate regions.

In recent years, two-stage models, including the Faster R-CNN mentioned in our previous sharing and other variations, generally have good performance in practical evaluations. Before the publication of this paper, one-stage models could not reach the level of two-stage models.

This paper proposes a new objective function called “Focal Loss” to replace the traditional objective function “Cross Entropy”. The main purpose of this new objective function is to enable a one-stage model to train a good model even in situations where the positive and negative instance ratios are extremely imbalanced, thereby allowing a one-stage model to achieve comparable performance with two-stage models. At the same time, the paper also proposes a relatively simple and easy-to-use deep network structure that can be easily trained as a whole model.

Core methods of the paper #

In this section, let’s talk about the meaning of “focal loss”. Because this is a new objective function, it is recommended that you read the original text to understand the mathematical nature of this objective function. Here, we will provide a highly summarized explanation of this new objective function.

We start with the commonly used cross-entropy (CE) objective function in ordinary binary classification problems. First, we assume that the probability that the model predicts a positive class is P. The CE objective function can basically be regarded as the negative logarithm of this probability, which is commonly used in machine learning as “negative log-likelihood”. The goal of the model is to minimize the “negative log-likelihood” and learn the model parameters.

The authors observed an phenomenon: when P is a relatively large value, such as greater than 0.5, the CE objective function still has a “loss”. What does this mean? It means that for a certain value, we already know that the probability of it being a positive class is greater than 0.5, which means we already roughly know the result, but the objective function still believes that the learning algorithm needs to act on this data point in order to reduce this “loss”.

This is actually the core of the whole problem. The traditional CE objective function does not guide the machine learning algorithm to focus on the places where it should put more effort, but instead disperses it to some data points that do not need to be further investigated. Of course, this also leads to learning difficulties.

The “focal loss” proposed in this paper made a seemingly minor modification to the CE, which is to multiply a coefficient of “opposite probability” before the “negative log-likelihood” of the CE objective function, and this coefficient has an exponent parameter to adjust its effect. If you are interested in this content, it is recommended that you refer to the original paper for details. If you are not interested in the details, just focus on understanding the function of this objective function.

The “focal loss” has two properties. First, when a data point is misclassified, and the true probability of this data point is very small, the loss is still similar to the CE. When the true probability of a data point approaches 1, that is, when the algorithm can be relatively confident, the loss becomes smaller relative to the CE. Second, the previously mentioned coefficient plays an adjusting role, determining which “easy-to-classify data points” need to have their losses reduced to what extent.

Based on the new “focal loss”, the paper proposes a new network structure called RetinaNet, which uses a staged approach to solve object detection and semantic segmentation tasks. Here, I will briefly summarize some characteristics of RetinaNet.

First, RetinaNet uses ResNet to extract basic image features from the original input image.

Second, the paper adopts a network architecture called Feature Pyramid Network (FPN) to extract features from images of different resolutions or sizes.

Third, similar to Faster R-CNN, RetinaNet also uses the concept of anchors, which means searching for the possibility of a relatively large rectangular box from a small moving window.

Finally, RetinaNet uses the features extracted from FPN for two parallel network structures: one for object classification and one for rectangle box localization. This is similar to the approach of a two-stage model.

Experimental Results of the Method #

The authors conducted experiments using RetinaNet on the popular image object detection dataset COCO. First, RetinaNet’s “average precision” outperforms all previous single-stage models, providing initial evidence of the superiority of the proposed objective function and network architecture. Additionally, in the experiments, the authors used different values for the “focal loss” exponent parameter to demonstrate the importance of this parameter on the results. Furthermore, the authors showed that RetinaNet can achieve comparable or even better results than the classical two-stage model Faster R-CNN and some variations in the experimental results.

Summary #

Today I have discussed the best student paper of ICCV 2017, which introduces the latest objective function, “Focal Loss,” in image object recognition.

Let’s review the key points: First, we briefly introduced the authors of this paper. Second, we analyzed the problem and main contributions of this paper. Third, we provided a detailed explanation of the core content of the proposed method.

Finally, I leave you with a question to ponder: Besides the method of modifying the objective function mentioned in this paper, what other commonly used methods do you think can be employed for imbalanced datasets?