Here's something I'd never heard of before: cloaking. Cloaking is the process of returning different pages to a search engine crawler for a given URL than you return to other users. You can imagine why people intent on getting higher search engine rankings than they deserve might want to do this. When you change the meaning of the page (rather then merely its structure) it's called "semantic cloaking."
You can't reliably detect cloaking from a single copy of the page. You need a page from the browser's perspective and one from the search engine's perspective. But even this isn't enough--some sites serve up different versions of a page to every visitor (e.g. changing a page or a feature). To reliably detect semantic cloaking, you need four copies.
That's a significant resource problem for the Web crawler and the sites serving up the pages. This research proposes a two step process that reduces the resources required and still yields good results.
The first step uses just two copies, one from each perspective, to filter out sites that aren't cloaking. The second step downloads two more pages and then uses all four to find sites that are really cloaking.
The classifier uses Joachim's SVMlight and looks at 162 features from each URL. The investigators manually labeled over 1200 URLs and then used 60% of these for training and retained the remainder for testing effectiveness.
The results are pretty good:
- Accuracy was 96%, precision was 93%, and recall was 85%. From the paper: "... precision means what percentage of the pages predicted by the classifier as semantic cloaking pages are actually utilizing semantic cloaking. Recall refers to the fraction of all semantic cloaking pages that are detected by the classifier."
- A test against the dmoz Open Directory Project showed that the filtering step reduces the 4.3 million candidate pages to just under 400,000 URLs that needed to be looked at further. The classifier found 46,000 URLs that used cloaking.
- The ODP test also showed that, not surprisingly, some categories are more likely to contain cloaked pages than others. Pages in Arts and Games categories were much more likely to use cloaking than pages in News, for example.