Optimizing the Performance of Machine Learning Algorithms in Detecting Malicious Files using Hybrid Models

Author

Haque, A S M Farhan Al

Term

4. semester

Education

Cyber Security and Privacy, Master

Publication year

2023

Submitted on

2023-08-04

Pages

Abstract

The exfiltration of digital systems using malcrafted files has been an evolving issue for the last two decades. Malicious actors deploy diverse payloads through files that posses potentiality of evading possible detection mechanism and cause alarming harm. Leveraging the universal file format, support of advanced features like JavaScript, and inclusion of additional files make Portable Document File (PDF) and Portable Executable (PE) an apparent choice for to be weaponized by the hackers. This project explores the performance of different branches of machine learning approaches in malware detection. Two dataset each for PDF and PE files are selected after an extensive review of the existing research. At first, Gaussian Naive Bayes (GNB) and Logistic Regression (LR) algorithms are applied from the classical branch. Random Forest (RF) from bagging and Adaptive Boosting (AdaBoost) from boosting are selected from the ensemble classification. Next, three variants of Artificial Neural Network (ANN) are deployed to improve the detection. Finally, a novel hybrid approach integrating ANN and ensemble techniques is proposed for both PDF and PE files and discovered that the hybrid model outperforms all the previous models. The hybrid model combining ANN with AdaBoost achieve an accuracy of 99.51% and F1-score of 99.53% for malware detection in PDF. Similarly, 98.45% of accuracy and 98.95% of F1-score for PE files.

Keywords

Machine learning ; Malware detection ; artificial neural network ; ensemble technique ; PDF malware detection ; PE malware detection

Documents

Download
View record in AAU Student Projects

An executive master's programme thesis from Aalborg University

Optimizing the Performance of Machine Learning Algorithms in Detecting Malicious Files using Hybrid Models