Sciweavers

WWW
2002
ACM

Parallel crawlers

14 years 5 months ago
Parallel crawlers
In this paper we study how we can design an effective parallel crawler. As the size of the Web grows, it becomes imperative to parallelize a crawling process, in order to finish downloading pages in a reasonable amount of time. We first propose multiple architectures for a parallel crawler and identify fundamental issues related to parallel crawling. Based on this understanding, we then propose metrics to evaluate a parallel crawler, and compare the proposed architectures using 40 million pages collected from the Web. Our results clarify the relative merits of each architecture and provide a good guideline on when to adopt which architecture. Keywords Web Crawler, Web Spider, Parallelization
Junghoo Cho, Hector Garcia-Molina
Added 22 Nov 2009
Updated 22 Nov 2009
Type Conference
Year 2002
Where WWW
Authors Junghoo Cho, Hector Garcia-Molina
Comments (0)