001 Chat About the Temporal Validation Award at Kdd 2017

001 Chat About the Temporal Validation Award at KDD 2017 #

The International Conference on Knowledge Discovery and Data Mining (ACM SIGKDD Conference on Knowledge Discovery and Data Mining), known as KDD, is the top conference in the field of data mining research, hosted by the Special Interest Group on Knowledge Discovery and Data Mining (SIGKDD) of the Association for Computing Machinery (ACM).

KDD initially started as a workshop in 1989, relying on top-tier conferences in artificial intelligence such as IJCAI or AAAI. In 1995, it evolved into a conference and has now been held for more than 20 years. This year’s KDD conference was successfully held in Halifax, Canada, from August 13th to 17th.

SIGKDD awards a paper every year called the “Test of Time Award.” This paper must have had a significant impact on research, methodology, and practice in the past ten years. The number of citations and the influence on a field are important criteria for selecting this award.

The 2017 KDD Test of Time Award was given to Thorsten Joachims, the Director of Information Science and a professor in the Department of Computer Science at Cornell University in the United States. He was honored for his paper “Training Linear SVMs in Linear Time,” which was also the best paper at the 2006 KDD conference and has been cited over 1600 times.

Thorsten’s Academic Contributions #

Thorsten is a renowned scholar in the field of machine learning and a dual fellow of ACM and AAAI. The total number of citations for all of his papers exceeds 40,000. After completing his PhD at Dortmund University in Germany in 2001, he joined Cornell University to engage in machine learning research.

Prior to receiving this award, Thorsten has won several significant awards, such as the Best Paper Award at ACM WSDM in 2017, the Time Test Award at ACM SIGIR in 2016 and ACM KDD in 2015, the Best Paper Award at ECML in 2009, the Best 10-Year Paper Award at ICML in 2009, the Best Paper Award at ACM KDD in 2006, the Best Paper Award at ICML in 2005, the Outstanding Student Paper Award at ICML in 2005, and the Best Student Paper Award at ACM KDD in 2005.

Thorsten has made remarkable contributions in the field of machine learning. Firstly, he has made numerous efforts in the application of Support Vector Machines (SVM). For example, this Time Test Award rewards his achievement in achieving linear complexity in SVM training, making it possible to apply SVM to large-scale data.

Thorsten is also dedicated to applying the basic algorithms of SVM, which are only designed for classification and regression problems, to more complex structured output tasks. This is commonly referred to as structured support vector machine algorithms. Thanks to this work, SVM can directly optimize many complex non-binary evaluation metrics in information retrieval, such as F1 score, Mean Average Precision, broadening the applications of SVM.

In the process of enabling SVM to be successfully applied to information retrieval, Thorsten also discovered another problem - how to use implicit user feedback from search engines to train ranking algorithms (often structured SVM models). Specifically, traditional search systems and information retrieval systems primarily rely on manually labeled training data for optimization and evaluation. The so-called manually labeled training data mainly refers to the human evaluation of whether the target query keywords are relevant to the corresponding web pages.

Early on, researchers discovered that although search engines could use this data to optimize ranking algorithms, search engines also generated a lot of user data during usage. These data could be the information generated by user clicks on search page results or other information (such as user dwell time on search pages). Initially, this information was not utilized for optimizing search engines. Thorsten and a group of researchers realized the importance of click information and began utilizing this data for training and evaluating ranking algorithms. This is Thorsten’s second major academic contribution.

Thorsten’s third major academic contribution, and his recent academic success, is combining causal inference with machine learning, which enables training models in a less biased manner. It can be said that this work has opened up a new field.

For a long time, how to effectively apply user-generated interaction data to model training has been a challenge in large-scale machine learning, especially in industrial machine learning. On one hand, industrial systems generate a lot of user data; on the other hand, these user data are influenced by the currently deployed systems and generally have certain biases.

Therefore, industrial-grade machine learning systems face a long-standing challenge: how to consider such biases when evaluating and training models in order to remove them.

Thorsten successfully introduced propensity scoring techniques from causal inference and the idea of multi-armed bandits into machine learning, making it possible to train models without bias. Currently, new research and ideas in this area are gaining more and more resonance in the field of machine learning and applications.

Large-scale Linear Support Vector Machines #

Returning to this paper on time checking awards, it addresses the problem of large-scale optimization of support vector machines, particularly linear support vector machines. This article proposes for the first time a simple and practical implementation of linear support vector machines, including support for ordinal regression. The algorithm achieves a complexity of O(sn) for classification problems, where s is the number of non-zero features and n is the number of data points, thus achieving linear complexity. For the problem of ordinal regression, the complexity is O(snlog(n)). The algorithm itself is simple, efficient, and easy to implement, and theoretically can be extended to the case of kernel functions.

Before this, many implementations of linear support vector machines could not achieve linear complexity. For example, the decomposition algorithms used in LibSVM (invented by scholars from the National Taiwan University), SVM-Torch, and early SVM-Light were only effective in dealing with large-scale features. However, for large-scale data (n), the complexity was super-linear.

Other methods that can achieve linear training complexity as the training data grows, actually have a quadratic (N^2) complexity with respect to the number of features N. Therefore, these previous methods cannot be applied to large-scale data. This situation is even more troublesome for ordinal regression support vector machines. Since the proposal of ordinal regression support vector machines by German scholar Ralf Herbrich, it has always been necessary to solve the problem by transforming it into a classification problem of ordinary support vector machines. This transformation process requires generating O(n^2) training data, resulting in a complexity at this magnitude for the entire problem’s solution.

In this paper, Thorsten first transforms the model formality of ordinary support vector machine algorithms. He writes the traditional classification support vector machine as a structured classification support vector machine, and provides a theorem to prove their equivalence. At first glance, this equivalent structured classification support vector machine does not provide much additional valuable information. However, due to its special sparsity, the dual form of this new optimization objective function enables it to be used for large-scale training. Next, Thorsten transforms the optimization function of traditional ordinal regression support vector machines into the form of a structured support vector machine, and proves their equivalence.

After expressing both models as special cases of structured support vector machines, Thorsten applies a specific algorithm for solving structured support vector machines - the Cutting-Plane algorithm (referred to as CP algorithm hereinafter) - to these two special cases. First, he demonstrates the application of the CP algorithm to classification problems. Simply put, this algorithm maintains a working set to store the constraint conditions that are still violated during the current iteration, and then focuses on optimizing these constraints in the next round.

The entire process starts with an empty working set, and each iteration optimizes a support vector machine subproblem based on the current working set. The algorithm continues until the errors of all constraint conditions are below a global parameter error. Thorsten provides detailed proof of the effectiveness and time complexity of this algorithm in the paper. The same method also enables the algorithm for ordinal regression support vector machines to be transformed into a more computationally efficient optimization process.

Thorsten conducts detailed experiments in the paper to demonstrate the effectiveness of the new algorithm. From a data perspective, he uses five different datasets, including several subsets of the Reuters RCV1 dataset. The dataset sizes range from over 60,000 data points to over 800,000 data points, and the number of features varies from tens to over 40,000 features. These different datasets are representative. In terms of method comparison, Thorsten mainly compares with traditional decomposition methods.

There are two key aspects of comparison, the first being training time. In all datasets, the algorithms proposed in this paper are several orders of magnitude faster than traditional algorithms, achieving a speedup of nearly 100 times. In the example of ordinal regression, the traditional algorithms fail to obtain the final results in all datasets. Thorsten further demonstrates the linear relationship between training time and dataset size, thereby verifying the performance of the proposed algorithm on real data.

The second important comparison metric is whether the accuracy of the algorithm is sacrificed. Because sometimes the speedup of the algorithm is achieved at the expense of algorithm accuracy, it is meaningful to validate the accuracy of the algorithm. In this paper, Thorsten shows that the proposed algorithm’s accuracy, i.e., classification accuracy, does not have statistically significant differences, ensuring the effectiveness of the algorithm.

Thorsten implemented this algorithm in his software package SVM-Perf. This software package has become a standard tool for support vector machine research and development.

Summary #

Today I shared with you Thorsten’s paper, which can be considered a classic in the literature of support vector machines. Let’s review the key points together: First, Thorsten has made three major academic contributions in the field of machine learning. Second, this paper has a solid theoretical proof, clear algorithms, and the effectiveness of the proposed algorithms is fully validated through effective experiments. The article opened up the wide application of support vector machines in the field of search, deserving the Best Paper Award at KDD in 2006, as well as this year’s Time Test Award paper.

Finally, here’s a question for you to ponder: In what application scenarios can linear large-scale support vector machines achieve good results?

Further reading: Training Linear SVMs in Linear Time