Optimizing the Performance of Machine Learning Algorithms in Detecting Malicious Files using Hybrid Models
Student thesis: Master programme thesis
- A S M Farhan Al Haque
4. semester, Cyber Security and Privacy, Master (Continuing Education Programme (Master))
The exfiltration of digital systems using malcrafted files has been an evolving issue for the last two decades. Malicious actors deploy diverse payloads
through files that posses potentiality of
evading possible detection mechanism
and cause alarming harm. Leveraging the universal file format, support
of advanced features like JavaScript,
and inclusion of additional files make
Portable Document File (PDF) and
Portable Executable (PE) an apparent
choice for to be weaponized by the
hackers. This project explores the performance of different branches of machine learning approaches in malware
detection. Two dataset each for PDF
and PE files are selected after an extensive review of the existing research. At
first, Gaussian Naive Bayes (GNB) and
Logistic Regression (LR) algorithms
are applied from the classical branch.
Random Forest (RF) from bagging and
Adaptive Boosting (AdaBoost) from
boosting are selected from the ensemble classification. Next, three variants
of Artificial Neural Network (ANN)
are deployed to improve the detection. Finally, a novel hybrid approach
integrating ANN and ensemble techniques is proposed for both PDF and
PE files and discovered that the hybrid
model outperforms all the previous
models. The hybrid model combining
ANN with AdaBoost achieve an accuracy of 99.51% and F1-score of 99.53%
for malware detection in PDF. Similarly, 98.45% of accuracy and 98.95%
of F1-score for PE files.
through files that posses potentiality of
evading possible detection mechanism
and cause alarming harm. Leveraging the universal file format, support
of advanced features like JavaScript,
and inclusion of additional files make
Portable Document File (PDF) and
Portable Executable (PE) an apparent
choice for to be weaponized by the
hackers. This project explores the performance of different branches of machine learning approaches in malware
detection. Two dataset each for PDF
and PE files are selected after an extensive review of the existing research. At
first, Gaussian Naive Bayes (GNB) and
Logistic Regression (LR) algorithms
are applied from the classical branch.
Random Forest (RF) from bagging and
Adaptive Boosting (AdaBoost) from
boosting are selected from the ensemble classification. Next, three variants
of Artificial Neural Network (ANN)
are deployed to improve the detection. Finally, a novel hybrid approach
integrating ANN and ensemble techniques is proposed for both PDF and
PE files and discovered that the hybrid
model outperforms all the previous
models. The hybrid model combining
ANN with AdaBoost achieve an accuracy of 99.51% and F1-score of 99.53%
for malware detection in PDF. Similarly, 98.45% of accuracy and 98.95%
of F1-score for PE files.
Language | English |
---|---|
Publication date | 4 Aug 2023 |
Number of pages | 71 |
Keywords | Machine learning, Malware detection, artificial neural network, ensemble technique, PDF malware detection, PE malware detection |
---|