Sciweavers

CIKM
2005
Springer

Joint deduplication of multiple record types in relational data

13 years 10 months ago
Joint deduplication of multiple record types in relational data
Record deduplication is the task of merging database records that refer to the same underlying entity. In relational databases, accurate deduplication for records of one type is often dependent on the decisions made for records of other types. Whereas nearly all previous approaches have merged records of different types independently, this work models these inter-dependencies explicitly to collectively deduplicate records of multiple types. We construct a conditional random field model of deduplication that captures these relational dependencies, and then employ a novel relational partitioning algorithm to jointly deduplicate records. For two citation matching datasets, we show that collectively deduplicating paper and venue records results in up to a 30% error reduction in venue deduplication, and up to a 20% error reduction in paper deduplication. Categories and Subject Descriptors: H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval—clustering General Te...
Aron Culotta, Andrew McCallum
Added 26 Jun 2010
Updated 26 Jun 2010
Type Conference
Year 2005
Where CIKM
Authors Aron Culotta, Andrew McCallum
Comments (0)