Sampling dirty data for matching attributes

10 years 11 months ago
Sampling dirty data for matching attributes
We investigate the problem of creating and analyzing samples of relational databases to find relationships between string-valued attributes. Our focus is on identifying attribute pairs whose value sets overlap, a pre-condition for typical joins over such attributes. However, real-world data sets are often ‘dirty’, especially when integrating data from different sources. To deal with this issue, we propose new similarity measures between sets of strings, which not only consider set based similarity, but also similarity between strings instances. To make the measures effective, we develop efficient algorithms for distributed sample creation and similarity computation. Test results show that for dirty data our measures are more accurate for measuring value overlap than existing sample-based methods, but we also observe that there is a clear tradeoff between accuracy and speed. This motivates a two-stage filtering approach, with both measures operating on the same samples. Catego...
Henning Köhler, Xiaofang Zhou, Shazia Wasim S
Added 18 Jul 2010
Updated 18 Jul 2010
Type Conference
Year 2010
Authors Henning Köhler, Xiaofang Zhou, Shazia Wasim Sadiq, Yanfeng Shu, Kerry L. Taylor
Comments (0)