Dynamic Malware Analysis: Detection and Family Classification using Machine Learning
Student thesis: Master Thesis and HD Thesis
- Steven Strandlund Hansen
- Thor Mark Tampus Larsen
4. term, Networks and Distributed Systems, Master (Master Programme)
This study is a continuation of a earlier research project made by members of this group, which aims to solve the problem of the increasing growth of malware each day. This research performs both detection and family classification based on dynamic analysis using machine learning. By improving and utilizing the analysis setup from the previous research, a customized version of Cuckoo, analyzes approximately 200,000 malware, using a total amount of 30 VMs. To cope with the large sample set it has been implemented in a scalable and distributed manner. In addition, a smaller setup was made to analyze approximately 850 cleanware samples, needed for detection.
API calls with its specified input arguments was used as features to represent a combination matrix including the following representation techniques: sequence, frequency and binary. %Here sequence was represented in two ways, namely API combined its input arguments and one where they were separated. The frequency consisted of APIs, Bins and the mostly used DLLs. At last the binary represented the signatures of different families.
Random Forests, are injected with features and labels, based on behavioral information extracted from malware, detected by Microsoft. Feature selection is performed based on Information Gain Ratio, to remove redundant and irrelevant features.
The detection gave a weighted average TPR, PPV and AUC of 0.969, 0.970 and 0.996, respectively. In addition, the family classification gave weighted average TPR, PPV and AUC of 0.865, 0.872 and 0.977, respectively. From the results, it was concluded, that detection and family classification can indeed be done based on dynamic analysis using Random Forests. It is believed, that this study can help solve the issue of dealing with the great amount of new malware, that emerge every day.
API calls with its specified input arguments was used as features to represent a combination matrix including the following representation techniques: sequence, frequency and binary. %Here sequence was represented in two ways, namely API combined its input arguments and one where they were separated. The frequency consisted of APIs, Bins and the mostly used DLLs. At last the binary represented the signatures of different families.
Random Forests, are injected with features and labels, based on behavioral information extracted from malware, detected by Microsoft. Feature selection is performed based on Information Gain Ratio, to remove redundant and irrelevant features.
The detection gave a weighted average TPR, PPV and AUC of 0.969, 0.970 and 0.996, respectively. In addition, the family classification gave weighted average TPR, PPV and AUC of 0.865, 0.872 and 0.977, respectively. From the results, it was concluded, that detection and family classification can indeed be done based on dynamic analysis using Random Forests. It is believed, that this study can help solve the issue of dealing with the great amount of new malware, that emerge every day.
Language | English |
---|---|
Publication date | 3 Jun 2015 |
Number of pages | 153 |