Detecting Splogs


I went to a session on blogging this afternoon. One talk was by Tim Finin on detecting splogs. He is part of the ebiquity research group at UMBC. He and his students do some interesting work in recognizing splogs. Tim wrote a funny splog bait post to see where it would get picked up.

Here's an interesting data point: the in-degree distribution of authentic blogs are described by a power-law, but splogs are not. The same is true of the out-degree. Ping times for real blogs is periodic according to the sleep cycle of the blogger. Splogs ping on a more constant basis.

Not surprisingly, English language blogs are much more likely to be splogs as are blogs in the info TLD.

It can take a minute or more for a person to determine whether a given blog is a splog or not. This makes coming up with good training data a problem.

Our reputation framework might be useful, with data from places like registrars, Alexa, and Netcraft to attach reputation data to URLs. You could also imagine a toolbar that would let users classify blogs they visit as splogs or not.