Sciweavers

JUCS
2008

Structure-Based Crawling in the Hidden Web

13 years 4 months ago
Structure-Based Crawling in the Hidden Web
: The number of applications that need to crawl the Web to gather data is growing at an ever increasing pace. In some cases, the criterion to determine what pages must be included in a collection is based on theirs contents; in others, it would be wiser to use a structure-based criterion. In this article, we present a proposal to build structure-based crawlers that just requires a few examples of the pages to be crawled and an entry point to the target web site. Our crawlers can deal with form-based web sites. Contrarily to other proposals, ours does not require a sample database to fill in the forms, and does not require the user to interact heavily. Our experiments prove that our precision is 100% in seventeen real-world web sites, with both static and dynamic content, and that our recall is 95% in the eleven static web sites examined. Key Words: Web crawling, hidden web, tree-edit distance, web wrappers Category: H.3.3, H.3.4, H.3.5, H.3.7
Márcio L. A. Vidal, Altigran Soares da Silv
Added 13 Dec 2010
Updated 13 Dec 2010
Type Journal
Year 2008
Where JUCS
Authors Márcio L. A. Vidal, Altigran Soares da Silva, Edleno Silva de Moura, João M. B. Cavalcanti
Comments (0)