Sciweavers

ICDE
2009
IEEE

Large-Scale Deduplication with Constraints Using Dedupalog

14 years 6 months ago
Large-Scale Deduplication with Constraints Using Dedupalog
We present a declarative framework for collective deduplication of entity references in the presence of constraints. Constraints occur naturally in many data cleaning domains and can improve the quality of deduplication. An example of a constraint is "each paper has a unique publication venue"; if two paper references are duplicates, then their associated conference references must be duplicates as well. Our framework supports collective deduplication, meaning that we can dedupe both paper references and conference references collectively in the example above. Our framework is based on a simple declarative Datalogstyle language with precise semantics. Most previous work on deduplication either ignore constraints or use them in an ad-hoc domain-specific manner. We also present efficient algorithms to support the framework. Our algorithms have precise theoretical guarantees for a large subclass of our framework. We show, using a prototype implementation, that our algorithms sca...
Arvind Arasu, Christopher Ré, Dan Suciu
Added 20 Oct 2009
Updated 20 Oct 2009
Type Conference
Year 2009
Where ICDE
Authors Arvind Arasu, Christopher Ré, Dan Suciu
Comments (0)