• Peter Grinderslev Stegger
This project examines how malware samples can be analysed to find malware clusters based on sequential behaviour in sequences of API function calls. The analysis is conducted by running malware samples through a Cuckoo sandbox hosting a virtual Windows machine. The reports generated is processed in a Python application to extract the API function calls as sequences of API names. Three data sets are created from the API sequence calls. All sets are cut and only the first 200 API function calls are included from each malware sample. In addition to this the second dataset have select API function calls filtered out, so only the most significant calls are included. The last dataset is like the second, but repeated sequences of API function calls are collapsed.
Calculation of distances between API call sequences are done with the Levenshtein distance and transformed into a ratio by dividing with the longest sequence length. The datasets are clustered using the OPTICS and hierarchical clustering algorithms. The silhouette score coefficient is used to evaluate the fitness of the clusters and distance matrixes are plotted to allow for visual evaluation as well.
The project concludes that it is possible to cluster malware by looking at the sequences of API function calls. Optimal clusters are found using the OPTICS algorithm on the third dataset. The best result is a mean Silhouette score of 0.8 disregarding noise and 0.6 including it. This shows that highly cohesive clusters of malware can be found using the proposed approach.
The project shows a potential in continuing research into temporal analysis of malware in general, but also specifically when considering API function calls.
Udgivelsesdato30 apr. 2021
Antal sider55
ID: 410480092