Sciweavers

KDD
2002
ACM

Interactive deduplication using active learning

15 years 15 days ago
Interactive deduplication using active learning
Deduplication is a key operation in integrating data from multiple sources. The main challenge in this task is designing a function that can resolve when a pair of records refer to the same entity in spite of various data inconsistencies. Most existing systems use hand-coded functions. One way to overcome the tedium of hand-coding is to train a classifier to distinguish between duplicates and non-duplicates. The success of this method critically hinges on being able to provide a covering and challenging set of training pairs that bring out the subtlety of the deduplication function. This is non-trivial because it requires manually searching for various data inconsistencies between any two records spread apart in large lists. We present our design of a learning-based deduplication system that uses a novel method of interactively discovering challenging training pairs using active learning. Our experiments on real-life datasets show that active learning significantly reduces the number ...
Sunita Sarawagi, Anuradha Bhamidipaty
Added 30 Nov 2009
Updated 30 Nov 2009
Type Conference
Year 2002
Where KDD
Authors Sunita Sarawagi, Anuradha Bhamidipaty
Comments (0)