Sciweavers

WWW
2008
ACM

IRLbot: scaling to 6 billion pages and beyond

14 years 5 months ago
IRLbot: scaling to 6 billion pages and beyond
This paper shares our experience in designing a web crawler that can download billions of pages using a single-server implementation and models its performance. We show that with the quadratically increasing complexity of verifying URL uniqueness, BFS crawl order, and fixed per-host ratelimiting, current crawling algorithms cannot effectively cope with the sheer volume of URLs generated in large crawls, highly-branching spam, legitimate multi-million-page blog sites, and infinite loops created by server-side scripts. We offer a set of techniques for dealing with these issues and test their performance in an implementation we call IRLbot. In our recent experiment that lasted 41 days, IRLbot running on a single server successfully crawled 6.3 billion valid HTML pages (7.6 billion connection requests) and sustained an average download rate of 319 mb/s (1, 789 pages/s). Unlike our prior experiments with algorithms proposed in related work, this version of IRLbot did not experience any bot...
Hsin-Tsang Lee, Derek Leonard, Xiaoming Wang, Dmit
Added 21 Nov 2009
Updated 21 Nov 2009
Type Conference
Year 2008
Where WWW
Authors Hsin-Tsang Lee, Derek Leonard, Xiaoming Wang, Dmitri Loguinov
Comments (0)