Text joins in an RDBMS for web data integration

12 years 2 months ago
Text joins in an RDBMS for web data integration
The integration of data produced and collected across autonomous, heterogeneous web services is an increasingly important and challenging problem. Due to the lack of global identifiers, the same entity (e.g., a product) might have different textual representations across databases. Textual data is also often noisy because of transcription errors, incomplete information, and lack of standard formats. A fundamental task during data integration is matching of strings that refer to the same entity. In this paper, we adopt the widely used and established cosine similarity metric from the information retrieval field in order to identify potential string matches across web sources. We then use this similarity metric to characterize this key aspect of data integration as a join between relations on textual attributes, where the similarity of matches exceeds a specified threshold. Computing an exact answer to the text join can be expensive. For query processing efficiency, we propose a samplin...
Luis Gravano, Panagiotis G. Ipeirotis, Nick Koudas
Added 22 Nov 2009
Updated 22 Nov 2009
Type Conference
Year 2003
Where WWW
Authors Luis Gravano, Panagiotis G. Ipeirotis, Nick Koudas, Divesh Srivastava
Comments (0)