Template detection for large scale search engines

13 years 10 months ago
Template detection for large scale search engines
Templates in web sites hurt search engine retrieval performance, especially in content relevance and link analysis. Current template removal methods suffer from processing speed and scalability when dealing with large volume web pages. In this paper, we propose a novel two-stage template detection method, which combines template detection and removal with the index building process of a search engine. First, web pages are segmented into blocks and blocks are clustered according to their style features. Second, similar contents sharing the common layout style are detected during the index building process. The blocks with similar layout style and content are identified as templates and deleted. Our experiment on eight popular web sites shows that our method achieves 20-40% faster than shingle and SST methods with close accuracy. Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval—Information filtering General Terms Algorit...
Liang Chen, Shaozhi Ye, Xing Li
Added 14 Jun 2010
Updated 14 Jun 2010
Type Conference
Year 2006
Where SAC
Authors Liang Chen, Shaozhi Ye, Xing Li
Comments (0)