003 Fine Reading of the Best Application Data Science Paper at Kdd 2017

003 Fine Reading of the Best Application Data Science Paper at KDD 2017 #

On Monday, we discussed the best research paper at KDD 2017, and today we will continue to talk about the best applied data science paper of the year.

Unlike research papers, the applied academic papers at KDD place more emphasis on the methods or systems described in the paper and their practical applications. For example, many papers summarize existing deployed systems, which often provide valuable insights for researchers and engineers in the industry. Similar to research papers, from the perspective of reading classic literature and learning about the latest research achievements, we should carefully analyze and discuss the best applied papers each year.

The title of the best applied data science paper at KDD 2017 is “HinDroid: An Intelligent Android Malware Detection System Based on Structured Heterogeneous Information Network.” It can be said that 2017 was a year when information security received considerable attention. During the 2016 US presidential election, various news reports emerged about Russia using hackers to infiltrate the candidates’ campaigns, which made the entire society exceptionally sensitive to the topic of information security. This paper is about how to intelligently analyze Android malware, making it highly relevant to the times.

Author Information #

The first and second authors of this article are both from the Department of Computer Science and Electrical Engineering at West Virginia University. The first author, Shifu Hou, is a doctoral student in the department and has published multiple papers. The second author, Yanfang Ye, is an assistant professor in the department. Yanfang Ye received her doctoral degree from Xiamen University in 2010 and has worked in research and development in the field of information security at Jinshan Corporation and Comodo Security Solutions. She joined West Virginia University as a faculty member in 2013. This KDD paper is recognized as the best student paper because the first author is also a student.

The third author, Yangqiu Song, is an assistant professor in the Department of Computer Science at The Hong Kong University of Science and Technology. Yangqiu Song has rich academic and industrial experiences. He joined The Hong Kong University of Science and Technology in 2016, and prior to that, he taught at West Virginia University. From 2012 to 2015, he visited the University of Illinois at Urbana-Champaign, The Hong Kong University of Science and Technology, and Huawei Noah’s Ark Lab. From 2009 to 2012, he worked at Microsoft Research Asia and IBM Research. He received his doctoral degree from Tsinghua University in 2009.

The last author is the Turkish entrepreneur Melih Abdulhayoğlu. He is the CEO of Comodo, a company he founded in 1998. His name is included in this paper because the data used in the research comes from Comodo.

Main Contributions of the Paper #

First, let’s take a look at the main contributions of this article. Similarly, following the approach we analyzed on Monday to identify the best research papers, we first need to understand what problem this article primarily addresses.

The problem this article aims to solve is how to effectively monitor malicious software in the Android operating system. It is predicted that by 2019, 77.7% of the global mobile market will be smartphones, with Android having a market share of at least 80% within this category. Due to the open nature of the Android system and the fragmented Android software market, monitoring and analyzing Android software poses significant challenges. There is a constant stream of various types of malicious software in the Android ecosystem, such as Geinimi, DroidKungfu, and Lotoor. A more pessimistic statistic comes from Symantec’s “Internet Security Threat Report,” which claims that one-fifth of Android software is malicious.

Previously, many methods for analyzing and detecting malicious software relied on a form of “fingerprinting” technology, but such techniques were often circumvented by new tactics employed by malicious software developers. Therefore, finding more complex and effective detection methods has become a goal pursued by information security companies.

The main contribution of this paper is the proposal of a new method based on the API of the Android system. This method utilizes a structural heterogeneous information network to model the API patterns of Android programs in a more complex way in order to understand the semantics of the entire Android program. The authors also employ a technique called “multi-kernel learning” to classify program semantic patterns based on the foundation of the structural heterogeneous information network.

Ultimately, the method proposed in the article achieves a very high level of accuracy on real data from Komodo, surpassing several mainstream methods currently in use. Additionally, Komodo has already deployed this method in their product.

Core Methods of the Paper #

Having understood the purpose and contributions of this paper, let’s now analyze the methods proposed by the authors.

Firstly, it is necessary to convert the Android program code into an analyzable form. Typically, Android software is packaged as a Dalvik executable file with the extension .dex, which cannot be directly analyzed. Therefore, this executable file needs to be parsed into Smali code using a disassembler called Smali. At this point, the semantics of the software can be parsed from the Smali code. The authors extract all API calls from the Smali code and model program behavior through API analysis.

The next step is to explore the patterns within the complex API calls. At this point, the authors construct four types of matrices to represent the basic characteristics between APIs and an App:

  1. Whether an App includes a specific API.
  2. Whether two APIs appear simultaneously in a segment of code.
  3. Whether two APIs appear in the same App.
  4. Whether two APIs use the same invocation method.

These matrices capture basic information between APIs and an App, as well as characteristics of a series of API co-occurrences. These matrices serve as the basis for discovering higher-order patterns.

To discover more complex patterns, the authors introduce a tool called heterogeneous information network. The concept of heterogeneous information network was originally proposed by Jiawei Han, an authority in data mining from the University of Illinois at Urbana-Champaign, and his then student Yizhou Sun (currently a professor at the University of California, Los Angeles). The core idea of a heterogeneous information network is to express complex patterns between a series of entities.

Traditional methods represent entities as nodes in a graph, with the relationships between entities represented as links between nodes. This approach overlooks the differences in entities themselves and the different types of relationships. A heterogeneous information network is a modeling tool that more comprehensively and systematically represents multiple entities and entity relationships. In this paper, there are two types of entities: App and API calls, and there are four types of relationships (corresponding to the previously defined matrices). The matrices defined earlier are actually the adjacency matrices of the graphs corresponding to these four types of relationships.

After describing the relationship between Apps and APIs as a heterogeneous information network, the next step is to define higher-order relationship patterns. In order to better define these complex relationships, the authors use a tool called meta-path. Meta-path provides a descriptive template language for defining higher-order relationships.

For example, we can define a “path” from an App to an API and then to another App, which describes the possibility that two Apps both contain the same API calls. This path can help us deduce more complex matrices from the initial four matrices to express additional information. Based on domain knowledge (in this case, the security domain), the authors define up to 16 meta-paths to comprehensively capture various relationships between Apps and APIs.

Once the semantic representation of the program is built using the heterogeneous information network and meta-paths, the next step is to discriminate malicious software. Here, the authors adopt the idea of multi-kernel learning. In essence, the new matrices generated by meta-paths are treated as a “kernel”. Multi-kernel learning involves learning a linear classifier, where the features are non-linear transformations from each App to a specific kernel, acquired during the learning process. In other words, the multi-kernel learning process simultaneously learns a classifier to determine if a program is malicious, as well as the transformation from an App to a kernel.

Experimental Results of the Method #

The authors used Kormodo’s dataset and collected information from 1834 apps over a two-month period in 2017. The dataset consisted of almost an equal number of normal programs and malicious programs. Additionally, there was another dataset with information from 30,000 apps, also with an almost equal distribution of positive and negative examples. Based on the experimental results, the multi-core learning approach incorporating 16 predefined metapaths can achieve an F1 score of up to 98%. The F1 score can be seen as a balance between precision and recall, and the accuracy is also as high as 98%.

The article also compared the method with other popular approaches, such as neural networks, Naive Bayes classifiers, decision trees, and support vector machines. The F1 scores of these methods ranged from 85% to 95%, showing a significant difference compared to the method mentioned in the article. Furthermore, the article compared the method with some commercial software, such as Norton, Lookout, and CM. The accuracy of these commercial software solutions hovered around 92%. Therefore, the method used in the article is indeed more effective than many previous methods.

Summary #

Today I talked to you about the best application paper at KDD 2017. This paper proposes how to analyze the behavior of Android smartphone software to detect malicious applications. Let’s review the key points: first, a brief introduction to the authorship of this paper. Second, a detailed introduction to the problem this paper aims to solve and its contributions. Third, a brief analysis of the core content of the proposed method in the paper.

In summary, the problem addressed in the paper is how to effectively monitor malicious software in the Android smartphone system. The main contribution is the introduction of a new method based on structured heterogeneous information networks to understand the semantics of Android programs. Complex relationships are defined using meta-paths, and a multi-core learning method is used to classify malicious software. The paper uses the dataset from Comodo to validate that the proposed method is more effective than some other popular methods.

Finally, I’ll leave you with a question to ponder: is the multi-core learning method mentioned in the paper necessary? Can it be replaced with other methods?

Further reading: HinDroid: An Intelligent Android Malware Detection System Based on Structured Heterogeneous Information Network