011 Fine Reading of the Best Research Paper at Nips 2017 Part Two Ksd Test on How to Check the Difference Between Two Distributions

011 Fine Reading of the Best Research Paper at NIPS 2017 Part Two - KSD Test on How to Check the Difference between Two Distributions #

This week, we will analyze and discuss three best papers from NIPS 2017. On Monday, we shared an article that mainly focused on “robust optimization problems,” which means considering not only the “mean” of a loss function but also the “variance” when optimizing it.

Today, we will look at another best paper, “A Linear-Time Kernel Goodness-of-Fit Test,” which discusses how to measure whether a set of data comes from a particular distribution.

This paper is also very theoretical, and here I will try to provide a summary from a higher perspective. If you are interested in the content of the article, I highly recommend reading the original text.

Introduction to the Authors #

This article has a total of five authors, and we will provide a brief introduction here.

The first author is Wittawat Jitkrittum, who has just graduated with a Ph.D. from the Gatsby Computational Neuroscience Unit at University College London. His main research during his Ph.D. was in “Statistical Tests”, particularly focusing on how to use “Kernel Methods” to test “Distributional Features”. Jitkrittum completed his undergraduate studies in Thailand and obtained his master’s degree from the Tokyo Institute of Technology in Japan. In recent years, Jitkrittum has published multiple high-quality papers at conferences such as NIPS, ICML, and UAI, making him a rising star in the field of statistical testing.

The second author, Wenkai Xu, is a Ph.D. student at the Gatsby Computational Neuroscience Unit.

The third author, Zoltán Szabó, comes from a renowned French institution, École Polytechnique. Szabó has previously worked at the Gatsby Computational Neuroscience Unit and currently holds the position of research associate professor at École Polytechnique, where he has been involved in research on kernel methods, information theory, and statistical machine learning.

The fourth author, Kenji Fukumizu, is a professor at the Institute of Statistical Mathematics, specializing in the research of kernel methods.

The last author, Arthur Gretton, is a machine learning professor at the Gatsby Computational Neuroscience Unit, with a long-standing focus on machine learning, particularly in the area of kernel methods. His papers have been cited over 9,000 times.

Main Contributions and Core Methods of the Paper #

First, let’s take a look at the main contributions of this paper and understand what problems it solves.

In general modeling scenarios, we often propose a model for a set of data to describe the underlying processes behind the data generation. This process is usually invisible and implicit. So, after proposing a model, how do we know if it accurately describes reality? This is where we need to use statistical testing methods.

One common method is to assume that our model is P and the data generation distribution is Q. In other words, we need to verify if P is equal to Q, or in other words, if the two distributions are equal. One basic approach is to “generate” a set of samples from P and another set of samples from Q, and then use “two-sample tests” to examine if the distributions behind these two sets of data are equal.

This idea seems flawless, but implementing it can often be challenging. The biggest operational difficulty is “generating samples from P”. For example, if P is a deep neural network model, generating samples from it is not a simple and computationally efficient process, which makes “two-sample testing” based on this approach difficult.

On the other hand, when conducting such statistical tests, it would be best to obtain a numerical value for each data point to describe the relationship between the current data point and the model, in order to provide us with a more intuitive understanding of whether the model fits the data.

Here, there is a testing method called “Maximum Mean Discrepancy” (MMD), which can achieve this effect. MMD was proposed by the last author, Arthur Gretton, and it is a method proposed at NIPS 2016 for testing whether two samples come from the same distribution. When the MMD value is large, it indicates that these two samples are more likely to come from different distributions.

Compared with the general methods for measuring the distance between two distributions, the distinctive feature of MMD is that it transforms both distributions using kernel methods into another space, called “Reproducing Kernel Hilbert Space” (RKHS). In this space, measurement becomes easier. However, unfortunately, MMD still requires obtaining samples from both distributions, which means we still need to obtain samples from P.

Therefore, the biggest contribution of this paper is using a series of techniques to make the comparison between P and Q independent of obtaining samples from P, so that the validation of the model with data only relies on a so-called “score function” of P.

In fact, in MMD, this score function already exists. It is a transformation of the samples extracted from P or Q through a function F, followed by an operation called the “kernel function” T, and finally the subtraction of the transformed results of the two samples.

In this paper, the authors propose the concept of “Kernel Stein Discrepancy” (KSD) or “KSD testing,” which essentially aims to make the term about P in these two equations equal to zero.

What does this mean? As mentioned earlier, one problem with MMD is that it still relies on P and samples from P. Assuming we can make the term depending on samples from P equal to zero, then our test would not require samples from P, bypassing the aforementioned difficulty.

The essence of KSD is making the second term of MMD zero at all times. Note that when we say “at all times,” it means KSD constructs a specific T called “Stein Operator,” in which the calculation of the second term with respect to samples from P is always zero regardless of F. The paper provides detailed proof of this. As a result, KSD does not rely on samples from P.

This paper not only explains the idea of KSD but also takes it further, attempting to transform the computation complexity of KSD, which is quadratic, into linear complexity. What does this mean? It means hoping to make the computation complexity of KSD increase linearly with the increase of data points, so that it can be applied to big data. We won’t go into detail on this aspect here.

Experimental Results of the Method #

Although the core content of this article is a theoretical result or algorithm innovation, the experiments were conducted on the “Restricted Boltzmann Machine” (RBM for short). Essentially, a simple change was made on one of the connections in the RBM while keeping the entire model the same.

If we have samples obtained from these two RBMs, it is difficult to know the difference between them. In the experiment, the traditional Maximum Mean Discrepancy (MMD) can hardly distinguish the differences between these two samples. However, both the Kernelized Stein Discrepancy (KSD) and the linear version of KSD can draw correct conclusions, and the final linear KSD performance basically increases linearly with the increase of data points, achieving a linear effect.

Finally, the authors used Chicago crime records as an illustration and used a “scoring function” to visually identify which points do not conform to the model. It should be said that for a paper with such strong theoretical content, such intuitive results are truly commendable.

Summary #

Today I have discussed another best research paper from NIPS 2017. The core idea of the paper is to propose a special operator that can measure the similarity of two distributions without relying on samples from the target distribution. This operator, such as the MMD method, can also achieve linear computation speed.

Let’s review the key points: First, I briefly introduced the author group of this paper. Second, I explained in detail the problem this paper aims to solve and its contributions. Third, I briefly introduced the experimental results of the paper.

Finally, here is a question for you to consider: Besides hypothesis testing, where else in machine learning do we often encounter the need to measure the distance between distributions?