POPFile Automatic Email Sorting using Naive Bayes

[Download] [Discuss] [Manual] [Main Page] [Mailing List]

Welcome to the home of POPFile the open source POP3 proxy that does email classification and sorting using Naive Bayes.

History

The Reverend Thomas Bayes lived in the English town of Tunbridge Wells in the 18th century.

Three years after his death in 1761 the Philosophical Transactions of the Royal Society of London published his paper Essay towards solving a problem in the doctrine of chances[PDF] that laid out the underpinnings of what would become known as Bayesian Statistics.

The most famous result from the paper is Bayes Theorem which shows how to calculate the probability of one event given that you know some other event has occurred. Algebraically that is:

P(A|B) = P(A) * P(B|A) / P(B)
Or the probability of A occurring given that B has occurred (P(A|B)) is the probability of A occurring (P(A)) times the probability of B occurring if A has occurred (P(B|A)) divided by the probability of B occurring (P(B)).

No doubt he had time to think this up because he wasn't spending all day sorting 200 emails into appropriate categories and deleting spam...

Meanwhile in the 21st century

Luckily, Bayes' 300 year old idea has a direct application to email sorting and text classification in general.

Imagine that you have three folders you'd like to sort email into: work, personal and spam (POPFile calls these folders 'buckets'). Setting up an email client to know how to sort the mail ranges from hard (in the case of work where you'd have to tell it about everyone in your company) and impossible (spammers keep changing their emails to evade filtering).

Bayes Theorem gives POPFile a way to calculate the probability that an email is work, personal or spam by calculating P(work|E), P(personal|E), and P(spam|E) where E is the new email and P(work|E) is the probability of email E being a 'work' email and so on. By picking the largest probability of the three POPFile can automatically pick the appropriate folder. POPFile calculates these probabilities by looking at the frequency with which words occur in each folder and applying Bayes Theorem.

A complete description of how POPFile calculates the probability for each email can be found here.

Once POPFile has determined a folder for an email it modified the Subject: line of the email to include the folder name. For example a mail with the subject Subject: hello john that belonged in the personal folder would have the subject Subject: [personal] hello john. POPFile also adds a new mail header called X-Text-Classification containing the folder name as well.

Either header can be used to set up simple filters in almost any mail client (it's been tested with Eudora, Outlook, Outlook Express, Mozilla, Netscape, ...). Instead of tens of complex half working filters you add one filter per folder and point your mail client at POPFile and away it goes.

Getting POPFile

POPFile is an open source project written in the Perl programming language and can be downloaded from here. Comments are very welcome in the Forums, bug reports should be placed in the Bug Database.

To install POPFile follow the instructions in the manual.

POPFile is intended to be cross-platform (at least Windows, Unix and Macintosh), it is totally free, and works with any mail client that uses the POP3 protocol.

Who wrote this?

POPFile was originally written by John Graham-Cumming.