Clustering Analysis of Malware Behavior
Student thesis: Master Thesis and HD Thesis
- Radu-Stefan Pirscoveanu
4. term, Networks and Distributed Systems, Master (Master Programme)
At present, behavioral classification of malware is realized by means of Antivirus generated labels. This
study investigates the inconsistencies associated with
current practices by using unsupervised learning on
malware behavior. Based on the problem isolation,
research was undertaken to determine how Antivirus
vendors label detected malware, as well as to raise
the problem of inconsistency in their labeling results.
A customized version of Cuckoo Sandbox was used
to collect actions from approximately 270,000 malware samples, and to create their behavioral profile
consisting of Passed and Failed API calls and their
respective Return Codes.
Evaluating the detection results of Antivirus vendors
on Completeness, Consistency and Correctness, and
based on the devised analysis, a temporary solution
was depicted, which involved performing a Majority
Vote between multiple vendors. A tokenized Levensthein ratio was used, in order to implement the
vote and determine the appropriate labels for evaluation. Following close examination of the limited
amount of options present in unsupervised Machine
Learning for Feature Selection and optimal number
of clusters, it was decided to make use of Principal
Component Analysis along with Gap Statistics. The
Self Organizing Map algorithm, preferred for clustering the behavioral data, provided an innovative
approach for preserving the topological properties of
the higher dimensionality information present in the
malware dataset.
Upon evaluation of the Self Organizing Map clusterer, and taking into consideration the limited range
of tools provided by unsupervised learning, the study
showed shortcomings when relying on AV vendors for
labeling malware samples. This is an indication, that
no link exists between AV extracted type labels and
generated behavioral clusters. To solve this discrepancy, a cluster-based classification is proposed, that
is able to accurately classify new mailicious software
using the clusters created with Self Organizing Map.
Language | English |
---|---|
Publication date | 2 Jun 2015 |
Number of pages | 139 |
Keywords | malware analysis, Self Organizing Map, malware types, unsupervised, Levensthein, Principal Component Analysis |
---|