Redundancy-Driven Web Data Extraction and Integration

15 years 5 months ago

Download www.dia.uniroma3.it

A large number of web sites publish pages containing structured information about recognizable concepts, but these data are only partially used by current applications. Although such information is spread across a myriad of sources, the web scale implies a relevant redundancy. We present a domain independent system that exploits the redundancy of information to automatically extract and integrate data from the Web. Our solution concentrates on sources that provide structured data about multiple instances from the same conceptual domain, e.g., ﬁnancial data, product information. Our proposal is based on an original approach that exploits the mutual dependency between the data extraction and the data integration tasks. Experiments on a sample of 175,000 pages conﬁrm the feasibility and quality of the approach.

Paolo Papotti, Valter Crescenzi, Paolo Merialdo, M

Real-time Traffic

Data Integration Tasks | Internet Technology | Relevant Redundancy | Sites Publish Pages | WEBDB 2010 |

claim paper

» HyLiEn a hybrid approach to general list extraction on the web

» ObjectRunner Lightweight Targeted Extraction and Querying of Structured Web Data

» Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction

» iCube A ToolSet for the Dynamic Extraction and Integration of Web Data Content

» Extracting Objects from the Web

» Ontologybased information extraction and integration from heterogeneous data sources

» Extracting Personalised Ontology from DataIntensive Web Application an HTML FormsBased Rev...

» GeneWebEx Gene Annotation Web Extraction Aggregation and Updating from WebBased Biomolecul...

Post Info
More Details (n/a)

Added	11 Jul 2010
Updated	11 Jul 2010
Type	Conference
Year	2010
Where	WEBDB
Authors	Paolo Papotti, Valter Crescenzi, Paolo Merialdo, Mirko Bronzi, Lorenzo Blanco

Comments (0)

Sciweavers

Redundancy-Driven Web Data Extraction and Integration

Data Integration Tasks | Internet Technology | Relevant Redundancy | Sites Publish Pages | WEBDB 2010 |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers