Forget the Rules

Matt Haughey mentions that SpamAssasin no longer appears to be working effectively against spam. Why? Because SpamAssasin is rules based. Not only that but they display their rules list for spammers to analyze. Just like any game if you show the other team your playbook or run the same plays over and over they'll know your game.

Statistical probability filtering or Bayes filtering techniques are the only way to effectively block spam. When working at SonicBoomerang filtering news and opinion postings we quickly found that rules have inherent error due to human bias. At first we thought that classifying news and opinion was a matter or identifying simple rules. Try as you might you can't come up with a rules based system that is as effective as statistical analysis based on a large training set.

Based on my experience 2000 spam messages are an effective training set to start with. Did your filter get a false negative or false positive? No problem. Add the error to the training set and your filter just got smarter without the introduction of bias.

I had attempted to find a server side Bayes filter but couldn't install the one I found without numerous errors. I'll try again shortly and hopefully come up with a web interface so all the users on my server can have this at their disposal.