Sciweavers

SIGIR
2008
ACM

Exploring traversal strategy for web forum crawling

13 years 4 months ago
Exploring traversal strategy for web forum crawling
In this paper, we study the problem of Web forum crawling. Web forum has now become an important data source of many Web applications; while forum crawling is still a challenging task due to complex in-site link structures and login controls of most forum sites. Without carefully selecting the traversal path, a generic crawler usually downloads many duplicate and invalid pages from forums, and thus wastes both the precious bandwidth and the limited storage space. To crawl forum data more effectively and efficiently, in this paper, we propose an automatic approach to exploring an appropriate traversal strategy to direct the crawling of a given target forum. In detail, the traversal strategy consists of the identification of the skeleton links and the detection of the page-flipping links. The skeleton links instruct the crawler to only crawl valuable pages and meanwhile avoid duplicate and uninformative ones; and the page-flipping links tell the crawler how to completely download a long...
Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai, Lei
Added 15 Dec 2010
Updated 15 Dec 2010
Type Journal
Year 2008
Where SIGIR
Authors Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai, Lei Zhang 0001, Wei-Ying Ma
Comments (0)