Dynamic Malware Analysis: Detection and Family Classification using Machine Learning
Authors
Hansen, Steven Strandlund ; Larsen, Thor Mark Tampus
Term
4. term
Education
Publication year
2015
Submitted on
2015-06-03
Pages
153
Abstract
Dette studie bygger videre på tidligere arbejde i gruppen for at håndtere den hurtigt voksende mængde malware. Vi bruger dynamisk analyse—at køre programmer i et kontrolleret miljø for at observere deres adfærd—og maskinlæring til både at opdage malware og til at klassificere den i kendte familier. Med udgangspunkt i den tidligere opsætning analyserede en tilpasset version af Cuckoo cirka 200.000 malware-prøver på tværs af 30 virtuelle maskiner i et skalerbart, distribueret system. Til detektionsdelen analyserede vi også cirka 850 cleanware (benigne) prøver. Vi beskrev programadfærd med API-kald og tilhørende inputargumenter og byggede features ud fra sekvens-, frekvens- og binære repræsentationer. Frekvensfeatures dækkede API'er, binaries (bins) og de mest brugte DLL'er, mens binære features indfangede familiesignaturer. Vi trænede Random Forests (et ensemble af beslutningstræer) på disse adfærdsbaserede features med labels baseret på Microsofts malware-detektioner. For at fjerne overflødige og irrelevante features anvendte vi Information Gain Ratio til feature-selektion. Detektion opnåede vægtet gennemsnitlig TPR (true positive rate) 0.969, PPV (positiv prædiktiv værdi/præcision) 0.970 og AUC (areal under ROC-kurven) 0.996. Familieklassifikation opnåede vægtet gennemsnitlig TPR 0.865, PPV 0.872 og AUC 0.977. Resultaterne viser, at dynamisk analyse kombineret med Random Forests effektivt kan opdage malware og klassificere familier og dermed hjælpe med at håndtere tilstrømningen af ny malware hver dag.
This study extends earlier work by our group to address the fast-growing volume of malware. We use dynamic analysis—running programs in a controlled environment to observe their behavior—and machine learning to both detect malware and assign it to known families. Building on the previous setup, a customized version of Cuckoo analyzed about 200,000 malware samples across 30 virtual machines in a scalable, distributed system. For detection, we also analyzed about 850 cleanware (benign) samples. We represented program behavior using API calls and their input arguments, and constructed feature sets based on sequence, frequency, and binary representations. Frequency features covered APIs, binaries (bins), and the most-used DLLs, while binary features captured family signatures. We trained Random Forests (an ensemble of decision trees) on these behavioral features, with labels derived from Microsoft’s malware detections. To remove redundant or uninformative features, we applied Information Gain Ratio for feature selection. Detection achieved weighted average TPR (true positive rate) 0.969, PPV (precision) 0.970, and AUC (area under the ROC curve) 0.996. Family classification achieved weighted average TPR 0.865, PPV 0.872, and AUC 0.977. These results show that dynamic analysis combined with Random Forests can effectively detect malware and classify families, helping manage the large influx of new malware seen each day.
[This abstract was generated with the help of AI]
Keywords
Documents
