Clustering Analysis of Malware Behavior

Student thesis: Master Thesis and HD Thesis

  • Radu-Stefan Pirscoveanu
4. term, Networks and Distributed Systems, Master (Master Programme)
At present, behavioral classification of malware is realized by means of Antivirus generated labels. This study investigates the inconsistencies associated with current practices by using unsupervised learning on malware behavior. Based on the problem isolation, research was undertaken to determine how Antivirus vendors label detected malware, as well as to raise the problem of inconsistency in their labeling results. A customized version of Cuckoo Sandbox was used to collect actions from approximately 270,000 malware samples, and to create their behavioral profile consisting of Passed and Failed API calls and their respective Return Codes. Evaluating the detection results of Antivirus vendors on Completeness, Consistency and Correctness, and based on the devised analysis, a temporary solution was depicted, which involved performing a Majority Vote between multiple vendors. A tokenized Levensthein ratio was used, in order to implement the vote and determine the appropriate labels for evaluation. Following close examination of the limited amount of options present in unsupervised Machine Learning for Feature Selection and optimal number of clusters, it was decided to make use of Principal Component Analysis along with Gap Statistics. The Self Organizing Map algorithm, preferred for clustering the behavioral data, provided an innovative approach for preserving the topological properties of the higher dimensionality information present in the malware dataset. Upon evaluation of the Self Organizing Map clusterer, and taking into consideration the limited range of tools provided by unsupervised learning, the study showed shortcomings when relying on AV vendors for labeling malware samples. This is an indication, that no link exists between AV extracted type labels and generated behavioral clusters. To solve this discrepancy, a cluster-based classification is proposed, that is able to accurately classify new mailicious software using the clusters created with Self Organizing Map.
Publication date2 Jun 2015
Number of pages139
ID: 213481763