Finding Replicated Web Collections

13 years 9 months ago

Download ilpubs.stanford.edu

Many web documents (such as JAVA FAQs) are being replicated on the Internet. Often entire document collections (such as hyperlinked Linux manuals) are being replicated many times. In this paper, we make the case for identifying replicated documents and collections to improve web crawlers, archivers, and ranking functions used in search engines. The paper describes how to eﬃciently identify replicated documents and hyperlinked document collections. The challenge is to identify these replicas from an input data set of several tens of millions of web pages and several hundreds of gigabytes of textual data. We also present two real-life case studies where we used replication information to improve a crawler and a search engine. We report these results for a data set of 25 million web pages (about 150 gigabytes of HTML data) crawled from the web.

Junghoo Cho, Narayanan Shivakumar, Hector Garcia-M

Real-time Traffic

Database | Document | Document Collections | Many Web Documents | SIGMOD 2000 |

claim paper

» Improved File Synchronization Techniques for Maintaining Large Replicated Collections over...

» Finding the boundaries of information resources on the web

» An Architecture for Finding Entities on the Web

» Finding visual concepts by web image mining

» Finding Comparative Facts and Aspects for Judging the Credibility of Uncertain Facts

» Finding text reuse on the web

» Undue influence eliminating the impact of link plagiarism on web search rankings

» Building a dynamic classifier for large text data collections

Post Info
More Details (n/a)

Added	01 Aug 2010
Updated	01 Aug 2010
Type	Conference
Year	2000
Where	SIGMOD
Authors	Junghoo Cho, Narayanan Shivakumar, Hector Garcia-Molina

Comments (0)

Sciweavers

Finding Replicated Web Collections

Database | Document | Document Collections | Many Web Documents | SIGMOD 2000 |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers