Eliminating Fuzzy Duplicates in Data Warehouses

15 years 4 months ago

Download www.vldb.org

The duplicate elimination problem of detecting multiple tuples, which describe the same real world entity, is an important data cleaning problem. Previous domain independent solutions to this problem relied on standard textual similarity functions (e.g., edit distance, cosine metric) between multi-attribute tuples. However, such approaches result in large numbers of false positives if we want to identify domain-specific abbreviations and conventions. In this paper, we develop an algorithm for eliminating duplicates in dimensional tables in a data warehouse, which are usually associated with hierarchies. We exploit hierarchies to develop a high quality, scalable duplicate elimination algorithm, and evaluate it on real datasets from an operational data warehouse.

Rohit Ananthakrishna, Surajit Chaudhuri, Venkatesh

Real-time Traffic

Data Warehouse | Database | Duplicate Elimination | Duplicate Elimination Problem | VLDB 2002 |

claim paper

» Robust Identification of Fuzzy Duplicates

Post Info
More Details (n/a)

Added	23 Dec 2010
Updated	23 Dec 2010
Type	Journal
Year	2002
Where	VLDB
Authors	Rohit Ananthakrishna, Surajit Chaudhuri, Venkatesh Ganti

Comments (0)

Sciweavers

Eliminating Fuzzy Duplicates in Data Warehouses

Data Warehouse | Database | Duplicate Elimination | Duplicate Elimination Problem | VLDB 2002 |

Explore & Download

Productivity Tools

Sciweavers