Evaluation of crawling policies for a web-repository crawler

9 years 4 months ago
Evaluation of crawling policies for a web-repository crawler
We have developed a web-repository crawler that is used for reconstructing websites when backups are unavailable. Our crawler retrieves web resources from the Internet Archive, Google, Yahoo and MSN. We examine the challenges of crawling web repositories, and we discuss strategies for overcoming some of these obstacles. We propose three crawling policies which can be used to reconstruct websites. We evaluate the effectiveness of the policies by reconstructing 24 websites and comparing the results with live versions of the websites. We conclude with our experiences reconstructing lost websites on behalf of others and discuss plans for improving our web-repository crawler. Categories and Subject Descriptors H.3.5 [Information Storage and Retrieval]: Online Information Services—Web-based services General Terms Measurement, Experimentation, Design Keywords digital preservation, website reconstruction, crawler policy, search engine
Frank McCown, Michael L. Nelson
Added 13 Jun 2010
Updated 13 Jun 2010
Type Conference
Year 2006
Where HT
Authors Frank McCown, Michael L. Nelson
Comments (0)