Sciweavers

WWW
2003
ACM

Efficient URL caching for world wide web crawling

14 years 5 months ago
Efficient URL caching for world wide web crawling
Crawling the web is deceptively simple: the basic algorithm is (a) Fetch a page (b) Parse it to extract all linked URLs (c) For all the URLs not seen before, repeat (a)?(c). However, the size of the web (estimated at over 4 billion pages) and its rate of change (estimated at 7% per week) move this plan from a trivial programming exercise to a serious algorithmic and system design challenge. Indeed, these two factors alone imply that for a reasonably fresh and complete crawl of the web, step (a) must be executed about a thousand times per second, and thus the membership test (c) must be done well over ten thousand times per second against a set too large to store in main memory. This requires a distributed architecture, which further complicates the membership test. A crucial way to speed up the test is to cache, that is, to store in main memory a (dynamic) subset of the "seen" URLs. The main goal of this paper is to carefully investigate several URL caching techniques for we...
Andrei Z. Broder, Marc Najork, Janet L. Wiener
Added 22 Nov 2009
Updated 22 Nov 2009
Type Conference
Year 2003
Where WWW
Authors Andrei Z. Broder, Marc Najork, Janet L. Wiener
Comments (0)