Sciweavers

CLEANDB
2006
ACM

Column Heterogeneity as a Measure of Data Quality

13 years 10 months ago
Column Heterogeneity as a Measure of Data Quality
Data quality is a serious concern in every data management application, and a variety of quality measures have been proposed, including accuracy, freshness and completeness, to capture the common sources of data quality degradation. We identify and focus attention on a novel measure, column heterogeneity, that seeks to quantify the data quality problems that can arise when merging data from different sources. We identify desiderata that a column heterogeneity measure should intuitively satisfy, and discuss a promising direction of research to quantify database column heterogeneity based on using a novel combination of cluster entropy and soft clustering. Finally, we present a few preliminary experimental results, using diverse data sets of semantically different types, to demonstrate that this approach appears to provide a robust mechanism for identifying and quantifying database column heterogeneity.
Bing Tian Dai, Nick Koudas, Beng Chin Ooi, Divesh
Added 13 Jun 2010
Updated 13 Jun 2010
Type Conference
Year 2006
Where CLEANDB
Authors Bing Tian Dai, Nick Koudas, Beng Chin Ooi, Divesh Srivastava, Suresh Venkatasubramanian
Comments (0)