Crawling the Content Hidden Behind Web Forms

13 years 10 months ago

Download www.tic.udc.es

The crawler engines of today cannot reach most of the information contained in the Web. A great amount of valuable information is “hidden” behind the query forms of online databases, and/or is dynamically generated by technologies such as JavaScript. This portion of the web is usually known as the Deep Web or the Hidden Web. We have built DeepBot, a prototype hiddenweb crawler able to access such content. DeepBot receives as input a set of domain definitions, each one describing a specific data-collecting task and automatically identifies and learns to execute queries on the forms relevant to them. In this paper we describe the techniques employed for building DeepBot and report the experimental results obtained when testing it with several real world data collection tasks.

Manuel Álvarez, Juan Raposo, Alberto Pan, F

Real-time Traffic

Applied Computing | Crawler Engines | ICCSA 2007 | Prototype Hiddenweb Crawler | Specific Data-collecting Task |

claim paper

Related Content

» Crawling the Hidden Web

» Googles Deep Web crawl

» StructureBased Crawling in the Hidden Web

» HDSampler revealing data behind web form interfaces

» DeepBot a focused crawler for accessing hidden web content

» Sitemaps above and beyond the crawl of duty

» Learning Deep Web Crawling with Diverse Features

» Why Johnny Cant Pentest An Analysis of BlackBox Web Vulnerability Scanners

» Usercentric Web crawling

Post Info
More Details (n/a)

Added	08 Jun 2010
Updated	08 Jun 2010
Type	Conference
Year	2007
Where	ICCSA
Authors	Manuel Álvarez, Juan Raposo, Alberto Pan, Fidel Cacheda, Fernando Bellas, Victor Carneiro

Comments (0)

Sciweavers

Crawling the Content Hidden Behind Web Forms

Applied Computing | Crawler Engines | ICCSA 2007 | Prototype Hiddenweb Crawler | Specific Data-collecting Task |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers