Sciweavers

ICDE
2003
IEEE

Text Joins for Data Cleansing and Integration in an RDBMS

14 years 5 months ago
Text Joins for Data Cleansing and Integration in an RDBMS
An organization's data records are often noisy because of transcription errors, incomplete information, lack of standard formats for textual data or combinations thereof. A fundamental task in a data cleaning system is matching textual attributes that refer to the same entity (e.g., organization name or address). This matching can be effectively performed via the cosine similarity metric from the information retrieval field. For robustness and scalability, these "text joins" are best done inside an RDBMS, which is where the data is likely to reside. Unfortunately, computing an exact answer to a text join can be expensive. In this paper, we propose an approximate, samplingbased text join execution strategy that can be robustly executed in a standard, unmodified RDBMS.
Luis Gravano, Panagiotis G. Ipeirotis, Nick Koudas
Added 01 Nov 2009
Updated 01 Nov 2009
Type Conference
Year 2003
Where ICDE
Authors Luis Gravano, Panagiotis G. Ipeirotis, Nick Koudas, Divesh Srivastava
Comments (0)