bogofilter-milter.pl — A Sendmail::Milter Milter for bogofilter

[A privi a această pagină în Română (View this page in Romanian) courtesy of azoft.]

Introduction

(Click here to skip to recent changes. Click here to return to my home page. Last updated at $Date: 2012/05/02 19:23:42 $.)

Bogofilter is a mail filter that classifies mail as spam or ham (non-spam) by a statistical analysis of the message's header and content (body). The program is able to learn from the user's classifications and corrections.

Bogofilter can be used in three different ways:

  1. it can be integrated into the user's email client (e.g., Evolution);
  2. it can be integrated into the user's delivery agent (e.g, procmail); or
  3. it can be integrated into the mail transfer agent, a.k.a. MTA (e.g., sendmail, Postfix).

To make it easier to integrate bogofilter and other applications into MTA's, several of them, including sendmail and Postfix, implement support mail filters called Milters. That's where bogofilter-milter.pl, a bogofilter Milter, comes into play. If you would like to do bogofilter spam filtering in the MTA, so that spam can be rejected before your mail server even tries to deliver it to a specific user, then you've come to the right place.

If, on the other hand, you don't understand anything written above, then you should probably go elsewhere. :-)

Installation and configuration

  1. Download the script here and save it somewhere on your mail server without the ".txt" extension.
  2. Run "perl -c" on it to confirm that you've got all the necessary Perl modules installed. If not, install the module that Perl reports is missing. Repeat until "perl -c" returns no errors.
  3. Search for "BEGIN CONFIGURABLE SETTINGS" in the script, carefully read through everything up until "END CONFIGURABLE SETTINGS", and modify the settings as appropriate for your site.
  4. Run "perl -c" on the script again to confirm that you didn't make a syntax error while editing setings.
  5. Set it up to be started automatically when your server reboots. For example, here's a script for Fedora and similar systems, e.g., RHEL and CentOS, and here's one (thanks to Tom Anderson) for Gentoo.
  6. Start it up using the init script you just installed.
  7. Tell your MTA to start using it. See, for example, documentation for sendmail and Postfix.
  8. Test, test, test! It's pretty much "set and forget" once it's working, but make sure it's working before you set and forget it!

Training

Bogofilter learns from each new message it sees. That is, when it sees a message that it thinks it spam, it learns that messages that look a lot like that one should also be considered spam. This is the essence of how bogofilter works.

When bogofilter either makes a mistake (i.e., decides that something is spam that actually isn't, or vice versa), it needs to be told that it made a mistake, so that it won't make similar mistakes in the future. Furthermore, when bogofilter is unsure whether a particular message is spam, you need to tell it so that it'll have a better chance of being able to figure it out next time.

When bogofilter is integrated into your email client, training bogofilter about what is and isn't spam is easy — you use commands in your email client to tell bogofilter whether a particular message is spam or not, and then the email client tells bogofilter to retrain the message as necessary based on your instructions. However, bogofilter-milter.pl has no such built-in training functionality, so you need to roll your own.

Here's how I do this. Feel free to use my method as-is, adapt it to your own tastes, or do something completely different. Understand, however, that if you do nothing, bogofilter will not work.

  1. I have four special IMAP folders called "bogotrain", "despam", "maybespam", and "spamtrain".
  2. I have a procmail recipe that redirects unsure messages into my "maybespam" folder so that I can classify them at my leisure and they don't clutter up my inbox:
    :0
    * ^X-Bogosity: (unsure|spam)
    { FOLDER=user.$LOGNAME.maybespam }
    
    ...
    
    :0 w : $MAILDIR/cyrus$LOCKEXT
    |formail -I "From " |/usr/lib/cyrus-imapd/deliver -a $LOGNAME -m $FOLDER -q
    
    Obviously, you'll have to do this slightly differently if you use something other than procmail and Cyrus imapd.
  3. I have bogofilter-milter.pl configured to save copies of both spam and ham messages. That is, I have symbolic links called "archive" and "ham_archive", the names configured for $archvie_mbox and $ham_archive_mbox, in my .bogofilter directory, pointing at where I would like the copies to be archived. I also run logrotate once per hour to rotate these archives when they get too large and to save about a month worth of old ham archives and about a week worth of old spam archives (anything more than that would simply take up too much disk space!).
  4. When I see a spam message in my inbox or maybespam that made it through bogofilter, I move it to either bogotrain or spamtrain, depending on whether I want to automatically submit it to SpamCop in addition to retraining bogofilter.
  5. When I see a ham message in my maybespam, I put a copy of it in despam and move the original into my inbox.
  6. When I discover that bogofilter has falsely classified a ham message as spam, I find the incorrectly classified message in my spam mbox archive, pipe it into "bogofilter -Sn" to classify it, and then remove it from the spam mbox and add it to the ham mbox. I'm OK with this particular task having to be done manually because it happens quite infrequently and I don't want a huge spam folder in my IMAP account.
    However, every once in a while I decide that I need to retrain bogofilter from scratch, i.e., remove my word list and let bogofilter start build a new one from new incoming email. When I'm doing this, bogofilter makes a lot of mistakes for a day or two, so I enable training mode in the Milter (which causes spam messages to be delivered to my inbox instead of being rejected), create an "isspam" folder in my IMAP inbox, and tell procmail to put messages there that bogofilter has classified as spam. Several times a day I go through this folder, delete the messages that are in fact spam, copy the non-spam messages into "despam", and then move the originals into my inbox.
  7. Now comes the magic... My spamtrain script is invoked once per hour out of my crontab to do retraining automatically. It reads my bogospam, spamtrain and despam folders, and for each message in them, it:
    1. finds the corresponding message in the ham or spam archive mbox;
    2. feeds it into bogofilter to retrain as necessary;
    3. moves it into the other archive mbox;
    4. (optionally) forwards it to SpamCop; and
    5. deletes the message from the IMAP folder.
    Because I have three different classification folders with different purposes, my script is invoked three times each hour: once with no arguments, once with "--mailbox bogospam --nospamcop", and once with "--mailbox despam --nospamcop --despam".

You may wonder why I keep archive mboxes of ham and spam. Theoretically, I could take each message directly from the IMAP folder and feed it into bogofilter for retraining. I don't do this for two reasons:

  1. The messages that sendmail delivers into my IMAP account are actually not entirely identical to the messages that are seen by the Milter. For example, sendmail adds a new "Received" header before delivery, makes formatting changes to some header fields, sometimes even makes changes to the body of the message, etc. This is especially true now that I've recently started using the Milter's "milter-filter-script" functionality to reformat incoming messages with spamitarium before running them through bogofilter. Keeping ham and spam archives improves bogofilter's accuracy by ensuring that bogofilter is always retrained with messages that are identical to what it would see when called by the milter.
  2. I use my ham and spam archives to retrain bogofilter when its accuracy starts to suffer.

Retraining

Bogofilter has a number of configuration parameters that can be tweaked to alter its behavior. The optimal values for these parameters vary over time and from person to person, because spammers are constantly changing the content of their messages to evade filters and because every person has a slightly different definition of what constitutes spam and ham.

You can tweak these configuration parameters by hand to try to make bogofilter work better, but it's easier, and you'll probably get better results, to let bogotune figure out the optimal settings.

I use a script called my-bogotrain to do this. It uncompresses and concatenates my ham and spam mbox archives into separate files in /tmp, runs bogofilter over them to check for misclassified messages, and then runs bogotune on the files when they're clean. Bogotune spits out the recommended parameters when it's done, and I copy them into ~/.bogofilter.cf. I usually only bother to do this when I notice that spam filtering isn't working so well.

See the comments at the top of my-bogotrain for more information about how to use it.

Spamitarium

Tom Anderson wrote a neat little script, spamitarium, whose purpose is to preprocess incoming email before it gets fed into bogofilter to decrease "noise" in the email and improve bogofilter's effectiveness. Tom's version of his script can be found here.

I wanted to use spamitarium as a filter in bogofilter-milter.pl (see the documentation of $filter_script in the configuration section of the Milter), but I found a few problems in doing so, which I've fixed.

I've submitted all of my fixes back to Tom, and I hope that at some point he will merge my changes or equivalent ones back into his version, but in the meantime I'm posting my version here for people to use if they so choose.

You can find my version, with a comment at the top indicating what I've changed, here. You can find my bogofilter-milter.pl filter script which calls spamitarium, i.e., the script I put in ~/.bogofilter/milter-filter-script, here.

Seeing it all in action

You can see some neat graphs showing the performance of my bogofilter installation on my home page.

ChangeLog

DateDescription
2010-04-12New version of "spamtrain" script which supports a "--redeliver" option for causing messages to be redelivered after they are processed. This is useful, e.g., if you've just despammed a message and you want it to go through your .procmailrc.
2010-04-08Update "Training" section to discuss using an "isspam" folder during training to make it easier to correct bogofilter when it misclassifies ham as spam.
2010-04-07New version (1.77): Messages should be archived in $archive_mbox and $ham_archive_mbox even when in training mode. This gives the user complete control over the behavior, since s/he can create or delete the archive files along with creating or deleting $training_file.
2010-04-07New version (1.76):
  1. Add support for passing various important information from the Milter to the filter script through environment variables. See the $filter_script documentation in the configuration section for more information.
  2. Post my version of spamitarium.pl and its wrapper script.
2010-04-07
  1. Published this page, including releasing my spamtrain and my-bogotrain scripts for the first time.
  2. New version of bogofilter-milter.pl (1.74) that supports feeding messages through an external filter before feeding them to bogofilter. Search for $filter_script in the configuration section for more information.
  3. bogofilter-milter.pl now adds a unique identifier to each message by adding a variable called "milter_id" to the X-Bogosity line. This is extremely useful for automated retraining tools such as my spamtrain script documented above, which need to be able to match up a message in an IMAP folder with the same message in an mbox archive created by the Milter, when the MTA may have changed the copy in the IMAP folder such that it is no longer identical to the copy in the mbox archive.
  4. bogofilter-milter.pl now supports per-user Subject filters, i.e., user-specific regular expressions which are matched against incoming messages to detect messages that should not be filtered. Search for $subject_filter_file in the configuration section for more information.