AAU Student Projects - visit Aalborg University's student projects portal
A master's thesis from Aalborg University
Book cover


WordAdjust: Et deobfuskerings frontend til content-aware anti-spam værktøjer

Translated title

WordAdjust: A Deobfuscation Frontend to Content-Aware Anti-Spam Tools

Authors

;

Term

2. term

Publication year

2008

Pages

21

Abstract

Spam er en stor udfordring for e-mailsystemer, og spammere bruger stadig mere sofistikerede tricks for at skjule beskeder for filtre. Dette arbejde fokuserer på tre sådanne tricks: Unicode-tegn, der ligner almindelige bogstaver, forvanskning (scrambling) af ord, og bevidste stavefejl. Vi præsenterer WordAdjust, et forfilter til SpamAssassin, der deobfuskerer teksten, dvs. gør den normal igen, før hovedfilteret analyserer den. WordAdjust håndterer de tre former for obfuskering, så skjulte ord bliver genkendelige. Modulet til stavefejl bygger på n-gram fuzzy søgeteknikker, som sammenligner korte bogstavsekvenser for at identificere ord, selv når nogle bogstaver mangler eller er forkerte. Fordi SpamAssassin har svært ved obfuskerede beskeder, forbedrer denne forbehandling opdagelsen af spam. I forsøg med bevidst obfuskeret spam øgede WordAdjust den gennemsnitlige SpamAssassin-score med 56%, hvilket gør det mere sandsynligt, at spam bliver markeret. Yderligere tests viste, at WordAdjust i gennemsnit fik SpamAssassin til at fange 10% af den spam, der ellers ville havne i brugernes indbakker. Resultaterne peger på, at deobfuskering som forbehandling kan styrke eksisterende antispam-systemer.

Spam is a major problem for email systems, and spammers keep inventing ways to hide messages from filters. This work focuses on three such tricks: Unicode look-alike characters, scrambling of words, and intentional misspellings. We present WordAdjust, a front-end deobfuscation filter that cleans up these tricks before the widely used spam filter SpamAssassin analyzes a message. WordAdjust reverses the three types of obfuscation so words appear normal again. The misspelling module uses n-gram fuzzy search, which compares short sequences of letters to recognize words even when some letters are missing or wrong. Because SpamAssassin struggles with obfuscated messages, this preprocessing improves detection. In experiments with intentionally obfuscated spam, WordAdjust increased the average SpamAssassin score by 56%, making spam more likely to be flagged. Additional tests showed that, on average, WordAdjust enabled SpamAssassin to catch 10% of the spam that would otherwise reach users' inboxes. These results suggest that deobfuscation as a preprocessing step can strengthen existing anti-spam systems.

[This abstract was generated with the help of AI]