Google's Deep Web crawl

15 years 20 days ago

Download www.cs.cornell.edu

The Deep Web, i.e., content hidden behind HTML forms, has long been acknowledged as a significant gap in search engine coverage. Since it represents a large portion of the structured data on the Web, accessing Deep-Web content has been a long-standing challenge for the database community. This paper describes a system for surfacing Deep-Web content, i.e., pre-computing submissions for each HTML form and adding the resulting HTML pages into a search engine index. The results of our surfacing have been incorporated into the Google search engine and today drive more than a thousand queries per second to Deep-Web content. Surfacing the Deep Web poses several challenges. First, our goal is to index the content behind many millions of HTML forms that span many languages and hundreds of domains. This necessitates an approach that is completely automatic, highly scalable, and very efficient. Second, a large number of forms have text inputs and require valid inputs values to be submitted. We p...

Jayant Madhavan, David Ko, Lucja Kot, Vignesh Gana

Real-time Traffic

Deep-Web Content | HTML Forms | PVLDB 2008 | Search Engine |

claim paper

» ISPEnabled Behavioral Ad Targeting without Deep Packet Inspection

» Learning Deep Web Crawling with Diverse Features

» Crawling the Content Hidden Behind Web Forms

» LeeDeo WebCrawled Academic Video Search Engine

» Service Class Driven Dynamic Data Source Discovery with DynaBot

» Exploiting the deep web with DynaBot matching probing and ranking

» Sitemaps above and beyond the crawl of duty

» Querying Capability Modeling and Construction of Deep Web Sources

Post Info
More Details (n/a)

Added	28 Dec 2010
Updated	28 Dec 2010
Type	Journal
Year	2008
Where	PVLDB
Authors	Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, Alon Y. Halevy

Comments (0)

Sciweavers

Google's Deep Web crawl

Deep-Web Content | HTML Forms | PVLDB 2008 | Search Engine |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers