Web-scale extraction of structured data

14 years 9 months ago

Download turing.cs.washington.edu

A long-standing goal of Web research has been to construct a unified Web knowledge base. Information extraction techniques have shown good results on Web inputs, but even most domain-independent ones are not appropriate for Web-scale operation. In this paper we describe three recent extraction systems that can be operated on the entire Web (two of which come from Google Research). The TextRunner system focuses on raw natural language text, the WebTables system focuses on HTML-embedded tables, and the deep-web surfacing system focuses on "hidden" databases. The domain, expressiveness, and accuracy of extracted data can depend strongly on its source extractor; we describe differences in the characteristics of data produced by the three extractors. Finally, we discuss a series of unique data applications (some of which have already been prototyped) that are enabled by aggregating extracted Web information.

Michael J. Cafarella, Jayant Madhavan, Alon Y. Hal

Real-time Traffic

Database | Extracted Data | Information Extraction Techniques | SIGMOD 2008 | Unified Web Knowledge |

claim paper

» Web Scale Competitor Discovery Using Mutual Information

» Automatic Event Extraction with Structured Preference Modeling

» Road network extraction from airborne LiDAR data using scene context

» Towards a Statistically Semantic Web

» Web Service Search on Large Scale

» RuleBased Information Extraction for Structured Data Acquisition using TextMarker

» Automatically Extracting Structure and Data from Business Reports

» A Relational Approach to Incrementally Extracting and Querying Structure in Unstructured D...

» Incorporating sitelevel knowledge to extract structured data from web forums

Post Info
More Details (n/a)

Added	08 Dec 2009
Updated	08 Dec 2009
Type	Conference
Year	2008
Where	SIGMOD
Authors	Michael J. Cafarella, Jayant Madhavan, Alon Y. Halevy

Comments (0)

Sciweavers

Web-scale extraction of structured data

Database | Extracted Data | Information Extraction Techniques | SIGMOD 2008 | Unified Web Knowledge |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers