010 Fine Reading of the Best Research Paper at Nips 2017 Part One How to Solve Non Convex Optimization Problems

010 Fine Reading of the Best Research Paper at NIPS 2017 Part One - How to Solve Non-Convex Optimization Problems #

The top conference in the field of machine learning and artificial intelligence, NIPS (Conference on Neural Information Processing Systems), has been held for over 30 years since its inception in 1987. The NIPS 2017 conference took place in Long Beach, California, from December 4th to 9th, 2017.

Every year, the conference selects a few papers with the most innovative and valuable content as the best research papers. At NIPS 2017, a total of three papers received the title of best paper. Today, I will analyze in detail one of these papers, titled “Variance-based Regularization with Convex Objectives”. The paper was authored by researchers from Stanford University.

This paper is highly theoretical and focuses on a type of optimization problem called “robust optimization”. In this context, when optimizing a loss function, one needs to consider not only the mean but also the variance of the loss function. However, creating a comprehensive loss function that takes both the mean and the variance into account often leads to a non-convex problem. For general non-convex optimization problems, it is difficult to find a global optimal solution, and even finding a local optimal solution can be challenging. The goal of this paper is to address this issue.

Author Group Information #

The first author, Hongseok Namkoong, is a doctoral student studying “Operations Research” at Stanford University. His advisors are John C. Duchi and Peter W. Glynn. Before coming to Stanford in 2013, Namkoong received his Bachelor’s degree in Industrial Engineering and Mathematics from the Korea Advanced Institute of Science and Technology (KAIST) in South Korea. In the past two or three years, Namkoong has published two papers in NIPS (including this best paper) and one paper in ICML.

The second author, John C. Duchi, is one of Namkoong’s advisors. Duchi is a highly accomplished researcher. He graduated from Stanford with a bachelor’s degree in 2007, and then pursued a master’s degree in computer science under the guidance of machine learning authority Daphne Koller. After that, he obtained a PhD in computer science from the University of California, Berkeley, under the guidance of statistical learning authority Michael Jordan. During his doctoral studies, Duchi had a valuable internship experience at Google Research, working with Yoram Singer. He later joined Stanford University as an assistant professor in the Department of Statistics and Electrical Engineering.

With such a solid foundation, Duchi has achieved remarkable academic results. He received the Best Paper Award at ICML in 2010. Following that, his work on AdaGrad during his Google internship in 2011 became a classic algorithm in the field of machine learning optimization, with over 2500 citations. It also served as an important foundation for deep learning optimization algorithms. Currently, Duchi’s papers have been cited over 6000 times.

Main Contributions of the Paper #

First, let’s take a look at the main contributions of this article and understand the problem it primarily addresses.

Many machine learning problems can ultimately be reduced to the optimization of an objective function or sometimes referred to as a loss function. Optimizing (minimizing or maximizing) the loss function on the training dataset and achieving excellent performance on the test dataset can demonstrate good “generalization”.

Typically, the loss function is a description of the mean, such as the average error or accuracy on the entire training dataset. However, we know that the mean is not a good data descriptor for highly skewed data distributions. Even if our function can optimize a loss function under “average” conditions, it may perform poorly on some or even most data points.

Therefore, researchers have introduced the concept of “robust optimization problems”. In other words, we want the loss function to perform well on more data points, i.e., we want the variance of the loss function to be small.

With this concept in mind, the next step seems natural, which is to connect the mean part (the part we usually deal with) and the variance part of the loss function to form a new objective function. This objective function has two parts, the mean part and the variance part, connected by a free parameter. In this way, we have a new robust optimization problem that considers both the mean and variance.

However, a comprehensive loss function that considers both the mean and variance often represents a “non-convex” problem. What does non-convex mean? A “convex” problem can be simply understood as a function with a unique minimum value, and we have efficient algorithms to find this minimum value. On the other hand, for non-convex problems, we often cannot find a global optimum or even finding a local optimum can be challenging.

Robust optimization problems have been brought up in previous research. The main contribution of this paper is to find an approximate expression of a convex problem for robust optimization problems and propose an optimization algorithm to obtain an approximate solution to this newly proposed convex problem.

It is worth noting here that introducing an approximate expression of a convex problem for a non-convex problem is an important approach in solving non-convex problems. Many classical non-convex problems have been partially or fully solved by approximating them with convex problems. In this regard, this article continues the long-standing strategy of addressing such problems.

Core Method of the Paper #

The core method and its proofs in this paper are highly theoretical and require a solid mathematical background and similar research experience to better understand. If you are interested in the content of the paper, it is recommended to not only read the original NIPS paper but also the additional documents, which consist of over 50 pages, in order to comprehensively understand the details of the paper. Here, we will provide a highly condensed summary of the concepts.

The authors propose a new objective function called “Robustly Regularized Risk” in this paper. This new objective function is built on a concept called “Empirical Distribution” and “Divergence”. The new robust regularized risk is a convex problem.

In simple terms, this robust regularized risk can be regarded as an equation consisting of two terms: the expectation of the loss function on the data set plus the variance of the loss function. In this new equation, both the expectation and variance are defined on the empirical distribution of the data. This way, this newly proposed risk equation becomes connected to the actual problem we need to solve. Of course, the rest of the paper aims to prove the difference between these two equations and whether the new equation provides a “tight” bound.

Subsequently, the paper discusses how this robust regularized risk can be expressed as a simpler optimization problem, and the paper provides a solution to this simplified optimization problem in the appendix.

Experimental Results #

Although the core content of this article is a theoretical result or algorithm innovation, experiments were still conducted on two datasets. One dataset is from UCI ML, which demonstrates that the proposed new robust objective function achieved better results compared to the general objective function. The other dataset is from RCV1 text classification problem, where the optimized objective function showed better performance than the general one.

Summary #

Today I presented one of the best research papers from NIPS 2017 for you, which is highly theoretical. The paper’s main idea is to robustify the objective function by modeling both the mean and variance of the loss function simultaneously.

Let’s review the key points together: First, we briefly introduced the author group information of this paper. Second, we discussed in detail the problem that this paper aims to solve and its contributions. Third, we briefly introduced the core content of the proposed method in the paper.

Finally, I leave you with a question to ponder: Besides incorporating both the mean and variance into the objective function as proposed in this paper, are there any other methods to control the variance of the objective function’s predictions?