Malware cluster analysis based on Windows API function call sequences

Author

Stegger, Peter Grinderslev

Term

4. term

Education

Master of Information Technology, Software Development (Continuing education)

Publication year

2021

Submitted on

2021-04-30

Abstract

This project investigates whether malicious software (malware) can be grouped by how it behaves when it runs—specifically, by the order in which it makes Windows programming interface calls (API calls). Malware samples are executed in a controlled test environment (a Cuckoo sandbox) on a virtual Windows machine. The resulting reports are processed in a Python application to extract sequences of API names. Three datasets are built from these sequences. In all of them, only the first 200 API calls from each sample are kept. In the second dataset, less informative calls are filtered out so only the most significant remain. The third dataset applies the same filtering and additionally collapses repeated sequences of API calls to reduce redundancy. Differences between sequences are measured with the Levenshtein distance (an edit distance counting how many changes are needed to turn one sequence into another) and normalized by the length of the longest sequence. The datasets are clustered using the OPTICS and hierarchical clustering algorithms. Cluster quality is evaluated with the silhouette coefficient and by visually inspecting plotted distance matrices. The results show that malware can be clustered by looking at sequences of API calls. The best clusters are obtained with OPTICS on the third dataset. The top mean silhouette score is 0.8 when ignoring noise and 0.6 when including it, indicating tight and well-separated clusters. The project highlights the potential of further research into temporal (sequence-based) malware analysis, especially using API call patterns.

Dette projekt undersøger, om ondsindet software (malware) kan grupperes efter, hvordan den opfører sig, når den kører—specifikt rækkefølgen af kald til Windows’ programmeringsgrænseflader (API-kald). Malwareprøverne køres i et kontrolleret testmiljø (Cuckoo-sandkasse) på en virtuel Windows-maskine. Rapporter herfra behandles i en Python-applikation for at udtrække sekvenser af API-navne. Der oprettes tre datasæt fra disse sekvenser. I alle beholdes kun de første 200 API-kald pr. prøve. I det andet datasæt filtreres mindre informative kald fra, så kun de mest betydningsfulde bevares. I det tredje datasæt anvendes den samme filtrering, og derudover komprimeres gentagne sekvenser af API-kald, så redundans reduceres. Forskelle mellem sekvenserne beregnes med Levenshtein-afstand (et mål for, hvor mange ændringer der skal til for at omskrive den ene sekvens til den anden) og normaliseres ved at dividere med længden af den længste sekvens. Datasættene klynges med OPTICS og hierarkisk klyngning. Kvaliteten af klyngerne vurderes med silhouette-koefficienten og ved visuel inspektion af plottede afstandsmatricer. Resultaterne viser, at malware kan klynges ved at se på sekvenser af API-kald. De bedste klynger findes med OPTICS på det tredje datasæt. Den bedste gennemsnitlige silhouette-score er 0,8 når støj ignoreres og 0,6 når den medregnes, hvilket indikerer tætte og adskilte klynger. Projektet peger på potentiale for videre forskning i tidslig (sekventiel) analyse af malware generelt og specifikt ved brug af API-kald.

[This abstract has been rewritten with the help of AI based on the project's original abstract]

Keywords

Malware analyse ; Levenshtein ; klyngeanalyse ; Windows API funktioner ; sekventielle data

Documents

Download PDF
View record in AAU Student Projects

An executive master's programme thesis from Aalborg University

Malware cluster analysis based on Windows API function call sequences