Pierre Palatin's corner

Random posts about some stuff I’ve been doing.

Automatic detection of mails moved from/to the spam folder

Posted at — Jan 1, 2008

A script allowing to just move mail between folders in IMap to train antispam.

This script work but is unmaintened. There’s an issue which can tend to damage the automatic learning. It should not drop your mails, but that might generate bad classification or nuke your dspam learning base.

I’m currently using dspam to filter my mails. However, as I’m using IMap, spam filtering is done server side. So, to identify false negative (FN) and false positive (FP), I cannot use some built-in feature of my mail clients (I have severals), I need to communicate with the server. Until recently, I was using the classic approach: when I got a FN or FP, I redirect the mail (with full headers) to a special address, which send it to dspam, telling it that it was a misclassification.

The problem with this approach in practice is that to mark a FP/FN I need to retransmit the mail, and move it to the correct folder, which is redundant. Of course, most mail clients can help doing that with some configuration, but still, that’s several operations where it is not really needed. Moreover, in the case of FN, it means sending through SMTP a spam, which can sometimes be a problem.

So, I’ve made a script which watches the content of the spam folder and detects mails which are added and removed. This way, to mark a FN as spam, I just need to move it to the Spam folder: the script will detect that a mail has been added, and will re-train dspam with the signature of the email. For FP, it’s the same thing: I just need to move the mail out of the spam folder, the script will detect that and call dspam with the signature of the moved email.

The script is a single-file python script.

It works with Maildir style mailboxes, dspam and a mysql database. However, the principle is simple and can easily be adapted. The implementation is currently really dumb and could be enhanced (especially resource-wise, for the regular scan) but it’s working.

The principle of the script is to scan the directory regularly to look for missing and added mails. The script must be plugged to the delivery system too (procmail in my case) to avoid trying to re-learn a spam already classified as spam.

How to use it :

Configuration is done. As long as you’re in dry run mode, you can watch the effect of the script by moving mail in and out from the Spam folder. Typically, moving a spam out then in (don’t forget to wait for the cron scan between operations) will produce those kind of log lines :

INFO 2008-11-09 11:40:09,710 [dryrun] Classify command: /usr/bin/dspam --signature=4916ba3b179033708835974 --class=innocent --source=error --client --user pierre INFO 2008-11-09 11:50:09,338 [dryrun] Classify command: /usr/bin/dspam --signature=4916ba3b179033708835974 --class=spam --source=error --client --user pierre

Once you think that your all set (a.k.a, you’ve configured the above and at least one scan was fully done), you can set DRY_RUN to False and enjoy a simple way to mark FP and FN in imap :)