Web data integration using approximate string join

13 years 6 months ago
Web data integration using approximate string join
Web data integration is an important preprocessing step for web mining. It is highly likely that several records on the web whose textual representations differ may represent the same real world entity. These records are called approximate duplicates. Data integration seeks to identify such approximate duplicates and merge them into integrated records. Many existing data integration algorithms make use of approximate string join, which seeks to (approximately) find all pairs of strings whose distances are less than a certain threshold. In this paper, we propose a new mapping method to detect pairs of strings with similarity above a certain threshold. In our method, each string is first mapped to a point in a high dimensional grid space, then pairs of points whose distances are 1 are identified. We implement it using Oracle SQL and PL/SQL. Finally, we evaluate this method using real data sets. Experimental results suggest that our method is both accurate and efficient. Categories and S...
Yingping Huang, Gregory R. Madey
Added 22 Nov 2009
Updated 22 Nov 2009
Type Conference
Year 2004
Where WWW
Authors Yingping Huang, Gregory R. Madey
Comments (0)