Structure-Based Crawling in the Hidden Web

13 years 4 months ago

Download www.jucs.org

: The number of applications that need to crawl the Web to gather data is growing at an ever increasing pace. In some cases, the criterion to determine what pages must be included in a collection is based on theirs contents; in others, it would be wiser to use a structure-based criterion. In this article, we present a proposal to build structure-based crawlers that just requires a few examples of the pages to be crawled and an entry point to the target web site. Our crawlers can deal with form-based web sites. Contrarily to other proposals, ours does not require a sample database to fill in the forms, and does not require the user to interact heavily. Our experiments prove that our precision is 100% in seventeen real-world web sites, with both static and dynamic content, and that our recall is 95% in the eleven static web sites examined. Key Words: Web crawling, hidden web, tree-edit distance, web wrappers Category: H.3.3, H.3.4, H.3.5, H.3.7

Márcio L. A. Vidal, Altigran Soares da Silv

Real-time Traffic

Form-based Web Sites | JUCS 2008 | Structure-based Criterion | Web Sites |

claim paper

Related Content

» Crawling the clientside hidden web

» Crawling the Hidden Web

» Crawling the Content Hidden Behind Web Forms

» Probabilistic models for focused web crawling

» Sitemaps above and beyond the crawl of duty

» Learning Deep Web Crawling with Diverse Features

» Query Selection Techniques for Efficient Crawling of Structured Web Sources

» Searching for HiddenWeb Databases

» Googles Deep Web crawl

Post Info
More Details (n/a)

Added	13 Dec 2010
Updated	13 Dec 2010
Type	Journal
Year	2008
Where	JUCS
Authors	Márcio L. A. Vidal, Altigran Soares da Silva, Edleno Silva de Moura, João M. B. Cavalcanti

Comments (0)

Sciweavers

Structure-Based Crawling in the Hidden Web

Form-based Web Sites | JUCS 2008 | Structure-based Criterion | Web Sites |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers