Sciweavers

EDBT
2010
ACM

HARRA: fast iterative hashed record linkage for large-scale data collections

13 years 11 months ago
HARRA: fast iterative hashed record linkage for large-scale data collections
We study the performance issue of the “iterative” record linkage (RL) problem, where match and merge operations may occur together in iterations until convergence emerges. We first propose the Iterative Locality-Sensitive Hashing (ILSH) that dynamically merges LSH-based hash tables for quick and accurate blocking. Then, by exploiting inherent characteristics within/across data sets, we develop a suite of I-LSH-based RL algorithms, named as HARRA (HAshed RecoRd linkAge). The superiority of HARRA in speed over competing RL solutions is thoroughly validated using various real data sets. While maintaining equivalent or comparable accuracy levels, for instance, HARRA runs: (1) 4.5 and 10.5 times faster than StringMap and R-Swoosh in iteratively linking 4,000 × 4,000 short records (i.e., one of the small test cases), and (2) 5.6 and 3.4 times faster than basic LSH and Multi-Probe LSH algorithms in iteratively linking 400,000 × 400,000 long records (i.e., the largest test case).
Hung-sik Kim, Dongwon Lee
Added 18 May 2010
Updated 18 May 2010
Type Conference
Year 2010
Where EDBT
Authors Hung-sik Kim, Dongwon Lee
Comments (0)