Sciweavers

AIRWEB
2008
Springer

Cleaning search results using term distance features

13 years 6 months ago
Cleaning search results using term distance features
The presence of Web spam in query results is one of the critical challenges facing search engines today. While search engines try to combat the impact of spam pages on their results, the incentive for spammers to use increasingly sophisticated techniques has never been higher, since the commercial success of a Web page is strongly correlated to the number of views that page receives. This paper describes a term-based technique for spam detection based on a simple new summary data structure called Term Distance Histograms that tries to capture the topical structure of a page. We apply this technique as a post-filtering step to a major search engine. Our experiments show that we are able to detect many of the artificially generated spam pages that remained in the results of the engine. Specifically, our method is able to detect many web pages generated by utilizing techniques such as dumping, weaving, or phrase stitching [11], which are spamming techniques designed to achieve high ranki...
Josh Attenberg, Torsten Suel
Added 12 Oct 2010
Updated 12 Oct 2010
Type Conference
Year 2008
Where AIRWEB
Authors Josh Attenberg, Torsten Suel
Comments (0)