A Query-Dependent Duplicate Detection Approach for Large Scale Search Engines

15 years 11 months ago

Download www.net-glyph.org

Duplication of Web pages greatly hurts the perceived relevance of a search engine. Existing methods for detecting duplicated Web pages can be classified into two categories, i.e. offline and online methods. The offline methods target to detect all duplicates in a large set of Web pages, but none of the reported methods is capable of processing more than 30 million Web pages, which is about 1% of the pages indexed by today's commercial search engines. On the contrary, the online methods focus on removing duplicated pages in the search results at run time. Although the number of pages to be processed is smaller, these methods could heavily increase the response time of search engines. Our experiments on real query logs show that there is a significant difference between popular and unpopular queries in terms of query number and duplicate distributions. Then, we propose a hybrid query-dependent duplicate detection method which combines both advantage of offline and online methods. Th...

Shaozhi Ye, Ruihua Song, Ji-Rong Wen, Wei-Ying Ma

Real-time Traffic

APWEB 2004 | Internet Technology | Online Methods | Search Engines | Web Pages |

claim paper

Post Info
More Details (n/a)

Added	20 Aug 2010
Updated	20 Aug 2010
Type	Conference
Year	2004
Where	APWEB
Authors	Shaozhi Ye, Ruihua Song, Ji-Rong Wen, Wei-Ying Ma

Comments (0)

Sciweavers

A Query-Dependent Duplicate Detection Approach for Large Scale Search Engines

APWEB 2004 | Internet Technology | Online Methods | Search Engines | Web Pages |

Explore & Download

Productivity Tools

Sciweavers