In the shower after my walk tonight, I was thinking about Google's page rank and that Spam is actually the opposite problem. The more people "paying attention" to a particular email message, the more likely it is Spam. So, here's the idea: strip off the headers and create an MD5 hash of the body. Put that in an associative array associated with a count. Everytime someone sees the email, increment the count. Any message with a count over 1000 is likely Spam (or a big mailing list). You could build this as a module in SpamAssassin and have a central clearing house that SpamAssassin uses. A test and increment function would result in a count being incremented and returned in a single call.

So someone has to have already tried this or determined why its a dumb idea. Which is it? One reason it might not work is that Spammer could individualize each message in a tiny way so that the hash broke.

Update: Pat Ekman writes to say that this is essentially what the Vipal's Razor module for SpamAssassin does. Very good. Does anyone care to comment on how well it works?


Please leave comments using the Hypothes.is sidebar.