A script allowing to just move mail between folders in IMap to train antispam.
This script work but is unmaintened. There’s an issue which can tend to damage the automatic learning. It should not drop your mails, but that might generate bad classification or nuke your dspam learning base.
I’m currently using dspam to filter my mails. However, as I’m using IMap, spam filtering is done server side. So, to identify false negative (FN) and false positive (FP), I cannot use some built-in feature of my mail clients (I have severals), I need to communicate with the server. Until recently, I was using the classic approach: when I got a FN or FP, I redirect the mail (with full headers) to a special address, which send it to dspam, telling it that it was a misclassification.
The problem with this approach in practice is that to mark a FP/FN I need to retransmit the mail, and move it to the correct folder, which is redundant. Of course, most mail clients can help doing that with some configuration, but still, that’s several operations where it is not really needed. Moreover, in the case of FN, it means sending through SMTP a spam, which can sometimes be a problem.
So, I’ve made a script which watches the content of the spam folder and detects mails which are added and removed. This way, to mark a FN as spam, I just need to move it to the Spam folder: the script will detect that a mail has been added, and will re-train dspam with the signature of the email. For FP, it’s the same thing: I just need to move the mail out of the spam folder, the script will detect that and call dspam with the signature of the moved email.
The script is a single-file python script.
It works with Maildir style mailboxes, dspam and a mysql database. However, the principle is simple and can easily be adapted. The implementation is currently really dumb and could be enhanced (especially resource-wise, for the regular scan) but it’s working.
The principle of the script is to scan the directory regularly to look for missing and added mails. The script must be plugged to the delivery system too (procmail in my case) to avoid trying to re-learn a spam already classified as spam.
How to use it :
~/bin/dspam_auto.py init
*/10 * * * * $HOME/bin/dspam_auto.py update $HOME/Maildir/.Spam
This line make the scan run every 10 minutes which is probably largely enough (especially that the current version of the script is not really nice to database :). Note that the first scan will detect all existing spam as FN, so double check that DRY_RUN is True before screwing your dpsam.
# Spam filtering:
:0fw | /usr/bin/dspam --stdout --deliver=spam,innocent --user pierre
# Tell the script for each detected spam
:0 ic * ^X-DSPAM-Result: spam | /home/pierre/dspam/dspam_auto.py push
# And deliver spam in the spam folder
:0: * ^X-DSPAM-Result: spam .Spam/
Note that the script is slightly racy, as calling the script and delivering the script is not atomic. However, as long as you don’t run the scan every 10 seconds it shall not matter much, and recover itself from previous mistakes anyway. The way to implement that with no race condition would be to do the delivery ourselves, but I prefer not to for reliability reason: if my script is screwed up, it won’t trash mails.
Configuration is done. As long as you’re in dry run mode, you can watch the effect of the script by moving mail in and out from the Spam folder. Typically, moving a spam out then in (don’t forget to wait for the cron scan between operations) will produce those kind of log lines :
INFO 2008-11-09 11:40:09,710 [dryrun] Classify command: /usr/bin/dspam --signature=4916ba3b179033708835974 --class=innocent --source=error --client --user pierre INFO 2008-11-09 11:50:09,338 [dryrun] Classify command: /usr/bin/dspam --signature=4916ba3b179033708835974 --class=spam --source=error --client --user pierre
Once you think that your all set (a.k.a, you’ve configured the above and at least one scan was fully done), you can set DRY_RUN to False and enjoy a simple way to mark FP and FN in imap :)