E-mail Categorization

Author

Bukszynski, Damian

Term

4. term

Education

Software Development, Master

Publication year

2013

Submitted on

2013-11-07

Abstract

Email has become a central medium for personal and professional communication. Messages carry not only text but also attachments, can be archived, and may even serve as evidence in some contexts. Keeping inboxes well organized is challenging, especially as message volume grows. Basic features such as journaling can offer limited help, but achieving a high level of organization calls for more advanced tools like machine learning. This thesis implements four machine learning algorithms to classify (i.e., automatically categorize) emails and evaluates their performance. It also develops and tests two methods that combine the outputs of individual classifiers. All methods are compared, and the one that delivers the best improvement to email classification is identified. A subset of the Enron email collection is used for training and testing. The best result comes from combining multiple classifiers, achieving an F-measure of 0.7102. The F-measure is a standard quality metric that balances precision and recall. The work provides readers with insights into information retrieval, machine learning, and related topics, both conceptually and through the accompanying software.

E-mails er blevet en central del af både privat og professionel kommunikation. De indeholder ikke kun tekst, men også vedhæftninger, kan arkiveres og kan i nogle tilfælde fungere som dokumentation. At holde en indbakke velstruktureret er dog svært, især når mængden af post stiger. Basale funktioner som journaling kan hjælpe lidt, men for virkelig god organisering er mere avancerede metoder som maskinlæring nødvendige. I denne afhandling er der implementeret fire maskinlæringsalgoritmer til at klassificere (dvs. automatisk kategorisere) e-mails, og deres ydeevne er undersøgt. Derudover er to metoder, der kombinerer resultaterne fra de enkelte klassifikatorer, udviklet og testet. Alle metoder er sammenlignet, og den, der gav den bedste forbedring af e-mail-klassifikationen, er udpeget. Til træning og test er der anvendt en del af Enron-e-mail-samlingen. Den bedste præstation blev opnået med en kombination af flere klassifikatorer og gav en F-måling på 0,7102. F-måling er et standardmål for kvalitet, der afvejer, hvor præcise og hvor fuldstændige resultaterne er. Arbejdet giver læseren indsigt i informationsgenfinding, maskinlæring og beslægtede emner – både i teori og gennem den udviklede software.

[This abstract has been rewritten with the help of AI based on the project's original abstract]

Keywords

categorization ; machine learning ; information retrieval

Documents

Download PDF
View record in AAU Student Projects

A master's thesis from Aalborg University

E-mail Categorization