Sciweavers

CIKM
2009
Springer

Graph-based seed selection for web-scale crawlers

13 years 11 months ago
Graph-based seed selection for web-scale crawlers
One of the most important steps in web crawling is determining the starting points, or seed selection. This paper identifies and explores the problem of seed selection in webscale incremental crawlers. We argue that seed selection is not a trivial but very important problem. Selecting proper seeds can increase the number of pages a crawler will discover, and can result in a repository with more “good” and less “bad” pages. We propose a graph-based framework for crawler seed selection, and present several algorithms within this framework. Evaluation on real web data showed significant improvements over heuristic seed selection approaches. Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval General Terms Algorithms, Design, Experimentation, Performance Keywords Crawler, Seed Selection, PageRank, Graph Analysis
Shuyi Zheng, Pavel Dmitriev, C. Lee Giles
Added 26 May 2010
Updated 26 May 2010
Type Conference
Year 2009
Where CIKM
Authors Shuyi Zheng, Pavel Dmitriev, C. Lee Giles
Comments (0)