E-mail Categorization

Studenteropgave: Kandidatspeciale og HD afgangsprojekt

  • Damian Bukszynski
4. semester, Softwarekonstruktion, Kandidat (Kandidatuddannelse)
Nowadays e-mails became the most important medium between individuals but also companies and various organizations and they settled down closely in almost any aspect of our everyday activity. E-mails are not just simple text information, they can also transport different kind of attachments. They can be archived and form a powerful, non-volatile source of knowledge and in some cases they can even constitute clear evidences in trials.
Maintaining mailboxes in a structured form is a challenging task. When incoming and outgoing correspondence have a low rate the task is relatively easy but as the rate is increasing the problem is getting more and more complicated and its solutions more and more time consuming. This process may be improved in a few ways. Most mailboxes allow for some helper options as Journaling to address and automate it at a basic level. However to achieve a really good organization level it is necessary to search for external tools such as machine learning methods.
Four machine learning algorithms have been implemented and their performance examined in this project. Also two additional methods based on combination of the results from the single classifiers have been implemented. Eventually all methods have been compared and the one which gives the best improvement to the e-mail classification process has been chosen.
This master thesis uses a part of the Enron e-mail collection [1] for training and testing phase. The best result achieved combination of single classifiers with F-measure equal to 0.7102.
The topics elaborated in the thesis, both the text and the software part, offer to the reader great knowledge about Information Retrieval, Machine Learning and related topics.
Udgivelsesdato7 nov. 2013
Antal sider78
ID: 81933275