Sciweavers

Free Online Productivity Tools i2Speak i2Symbol i2OCR iTex2Img iWeb2Print iWeb2Shot i2Type iPdf2Split iPdf2Merge i2Bopomofo i2Arabic i2Style i2Image i2PDF iLatex2Rtf Sci2ools

101

SIGMOD
2003
ACM

favoriteEmaildiscussreport

190views Database» more SIGMOD 2003»

Extracting Structured Data from Web Pages

15 years 3 months ago

Extracting Structured Data from Web Pages

Download infolab.stanford.edu

Many web sites contain large sets of pages generated using a common template or layout. For example, Amazon lays out the author, title, comments, etc. in the same way in all its book pages. The values used to generate the pages (e.g., the author, title,...) typically come from a database. In this paper, we study the problem of automatically extracting the database values from such templategenerated web pages without any learning examples or other similar human input. We formally deﬁne a template, and propose a model that describes how values are encoded into pages using a template. We present an algorithm that takes, as input, a set of template-generated pages, deduces the unknown template used to generate the pages, and extracts, as output, the values encoded in the pages. Experimental evaluation on a large number of real input page collections indicates that our algorithm correctly extracts data in most cases.

Arvind Arasu, Hector Garcia-Molina

Real-time Traffic

Book Pages | Database | Page | SIGMOD 2003 | Templategenerated Web Pages |

claim paper

Related Content

» Data Extraction from Web Data Sources

» Syntactic Folding and its Application to the Information Extraction from Web Pages

» Deep web data extraction

» NET A System for Extracting Web Data from Flat and Nested Data Records

» Incorporating sitelevel knowledge to extract structured data from web forums

» Extracting Web Data Using InstanceBased Learning

» FiVaTech PageLevel Web Data Extraction from Template Pages

» Learning PageIndependent Heuristics for Extracting Data from Web Pages

» GeneWebEx Gene Annotation Web Extraction Aggregation and Updating from WebBased Biomolecul...

Post Info
More Details (n/a)

Added	05 Jul 2010
Updated	05 Jul 2010
Type	Conference
Year	2003
Where	SIGMOD
Authors	Arvind Arasu, Hector Garcia-Molina

Comments (0)