A Query-Dependent Duplicate Detection Approach for Large Scale Search Engines

13 years 8 months ago

Download www.net-glyph.org

Duplication of Web pages greatly hurts the perceived relevance of a search engine. Existing methods for detecting duplicated Web pages can be classified into two categories, i.e. offline and online methods. The offline methods target to detect all duplicates in a large set of Web pages, but none of the reported methods is capable of processing more than 30 million Web pages, which is about 1% of the pages indexed by today's commercial search engines. On the contrary, the online methods focus on removing duplicated pages in the search results at run time. Although the number of pages to be processed is smaller, these methods could heavily increase the response time of search engines. Our experiments on real query logs show that there is a significant difference between popular and unpopular queries in terms of query number and duplicate distributions. Then, we propose a hybrid query-dependent duplicate detection method which combines both advantage of offline and online methods. Th...

Shaozhi Ye, Ruihua Song, Ji-Rong Wen, Wei-Ying Ma

Real-time Traffic