IRLbot: scaling to 6 billion pages and beyond

14 years 5 months ago

Download irl.cs.tamu.edu

This paper shares our experience in designing a web crawler that can download billions of pages using a single-server implementation and models its performance. We show that with the quadratically increasing complexity of verifying URL uniqueness, BFS crawl order, and fixed per-host ratelimiting, current crawling algorithms cannot effectively cope with the sheer volume of URLs generated in large crawls, highly-branching spam, legitimate multi-million-page blog sites, and infinite loops created by server-side scripts. We offer a set of techniques for dealing with these issues and test their performance in an implementation we call IRLbot. In our recent experiment that lasted 41 days, IRLbot running on a single server successfully crawled 6.3 billion valid HTML pages (7.6 billion connection requests) and sustained an average download rate of 319 mb/s (1, 789 pages/s). Unlike our prior experiments with algorithms proposed in related work, this version of IRLbot did not experience any bot...

Hsin-Tsang Lee, Derek Leonard, Xiaoming Wang, Dmit

Real-time Traffic

BFS Crawl Order | Current Crawling Algorithms | General Terms Algorithms | Internet Technology | WWW 2008 |

claim paper

Post Info
More Details (n/a)

Added	21 Nov 2009
Updated	21 Nov 2009
Type	Conference
Year	2008
Where	WWW
Authors	Hsin-Tsang Lee, Derek Leonard, Xiaoming Wang, Dmitri Loguinov

Comments (0)

Sciweavers

IRLbot: scaling to 6 billion pages and beyond

BFS Crawl Order | Current Crawling Algorithms | General Terms Algorithms | Internet Technology | WWW 2008 |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers