Sciweavers

NSDI
2010

The Architecture and Implementation of an Extensible Web Crawler

13 years 5 months ago
The Architecture and Implementation of an Extensible Web Crawler
Many Web services operate their own Web crawlers to discover data of interest, despite the fact that largescale, timely crawling is complex, operationally intensive, and expensive. In this paper, we introduce the extensible crawler, a service that crawls the Web on behalf of its many client applications. Clients inject filters into the extensible crawler; the crawler evaluates all received filters against each Web page, notifying clients of matches. As a result, the act of crawling the Web is decoupled from determining whether a page is of interest, shielding client applications from the burden of crawling the Web themselves. This paper describes the architecture, implementation, and evaluation of our prototype extensible crawler, and also relates early experience from several crawler applications we have built. We focus on the challenges and trade-offs in the system, such as the design of a filter language that is simultaneously expressive and efficient to execute, the use of filter ...
Jonathan M. Hsieh, Steven D. Gribble, Henry M. Lev
Added 29 Oct 2010
Updated 29 Oct 2010
Type Conference
Year 2010
Where NSDI
Authors Jonathan M. Hsieh, Steven D. Gribble, Henry M. Levy
Comments (0)