Sciweavers

DEXA
2004
Springer

PC-Filter: A Robust Filtering Technique for Duplicate Record Detection in Large Databases

13 years 10 months ago
PC-Filter: A Robust Filtering Technique for Duplicate Record Detection in Large Databases
: In this paper, we will propose PC-Filter (PC stands for Partition Comparison), a robust data filter for approximately duplicate record detection in large databases. PC-Filter distinguishes itself from all of existing methods by using the notion of partition in duplicate detection. It first sorts the whole database and splits the sorted database into a number of record partitions. The Partition Comparison Graph (PCG) is then constructed by performing fast partition pruning. Finally, duplicate records are effectively detected by using internal and external partition comparison based on PCG. Four properties, used as heuristics, have been devised to achieve a remarkable efficiency of the filter based on triangle inequity of record similarity. PCFilter is insensitive to the key used to sort the database, and can achieve a very good recall level that is comparable to that of the pair-wise record comparison method but only with a complexity of O(N4/3 ). Equipping existing detection methods ...
Ji Zhang, Tok Wang Ling, Robert M. Bruckner, Han L
Added 01 Jul 2010
Updated 01 Jul 2010
Type Conference
Year 2004
Where DEXA
Authors Ji Zhang, Tok Wang Ling, Robert M. Bruckner, Han Liu
Comments (0)