Optimizing the Performance of Machine Learning Algorithms in Detecting Malicious Files using Hybrid Models
Author
Haque, A S M Farhan Al
Term
4. semester
Education
Publication year
2023
Submitted on
2023-08-04
Pages
71
Abstract
I mange år har hackere gemt skadelig kode i tilsyneladende almindelige filer for at omgå sikkerhedskontroller. Portable Document Format (PDF) er udbredt og kan indeholde JavaScript og indlejrede filer, og Portable Executable (PE) er kørbare Windows-filer. Begge typer gør det attraktivt at skjule malware. Dette projekt undersøger, hvor godt forskellige grene af maskinlæring kan opdage sådan malware. Efter en grundig gennemgang af tidligere forskning blev to datasæt for henholdsvis PDF- og PE-filer udvalgt. Vi afprøver klassiske metoder som Gaussian Naive Bayes og logistisk regression, ensemblemetoder som Random Forest (bagging) og AdaBoost (boosting), samt tre varianter af kunstige neurale netværk. Derefter foreslås en ny hybrid tilgang, der kombinerer neurale netværk med ensembleteknikker for begge filtyper. Resultaterne viser, at den hybride model, som kombinerer et neuralt netværk med AdaBoost, klarer sig bedst. For PDF-filer opnår den en nøjagtighed på 99,51% og en F1-score på 99,53%, og for PE-filer 98,45% i nøjagtighed og 98,95% i F1-score. Nøjagtighed angiver, hvor ofte systemet har ret, mens F1-score afspejler balancen mellem at fange malware og undgå falske alarmer. Samlet set peger resultaterne på, at en kombination af neurale netværk og ensemblemetoder kan give meget effektiv malwaredetektion i udbredte filformater.
For years, attackers have hidden malicious code inside seemingly ordinary files to bypass security checks. Portable Document Format (PDF) files are widely used and can include JavaScript and embedded files, and Portable Executable (PE) files are Windows programs. These features make both formats attractive for hiding malware. This project examines how well different branches of machine learning can detect such threats. After an extensive review of prior research, two datasets each for PDF and PE files were selected. We test classical methods such as Gaussian Naive Bayes and Logistic Regression, ensemble methods such as Random Forest (bagging) and AdaBoost (boosting), and three variants of Artificial Neural Networks. We then propose a new hybrid approach that combines neural networks with ensemble techniques for both file types. The results show that the hybrid model combining a neural network with AdaBoost performs best. It achieves 99.51% accuracy and a 99.53% F1-score for PDF malware detection, and 98.45% accuracy and a 98.95% F1-score for PE files. Accuracy reflects how often the system is correct, while the F1-score balances catching malware with avoiding false alarms. Overall, the findings indicate that combining neural networks and ensemble methods can deliver highly effective malware detection for common file formats.
[This abstract was generated with the help of AI]
Keywords
Documents
