A grammar-based entity representation framework for data cleaning

14 years 6 months ago
A grammar-based entity representation framework for data cleaning
Fundamental to data cleaning is the need to account for multiple data representations. We propose a formal framework that can be used to reason about and manipulate data representations. The framework is declarative and combines elements of a generative grammar with database querying. It also incorporates actions in the spirit of programming language compilers. This framework has multiple applications such as parsing and data normalization. Data normalization is interesting in its own right in preparing data for analysis as well as in pre-processing data for further cleansing. We empirically study the utility of the framework over several real-world data cleaning scenarios and find that with the right normalization, often the need for further cleansing is minimized. Categories and Subject Descriptors H.2 [Database Management]: Systems General Terms Design, Algorithms, Experimentation Keywords Data Cleaning, Entity Resolution, Deduplication
Arvind Arasu, Raghav Kaushik
Added 05 Dec 2009
Updated 05 Dec 2009
Type Conference
Year 2009
Authors Arvind Arasu, Raghav Kaushik
Comments (0)