Sciweavers

PVLDB
2008

Ed-Join: an efficient algorithm for similarity joins with edit distance constraints

13 years 3 months ago
Ed-Join: an efficient algorithm for similarity joins with edit distance constraints
There has been considerable interest in similarity join in the research community recently. Similarity join is a fundamental operation in many application areas, such as data integration and cleaning, bioinformatics, and pattern recognition. We focus on efficient algorithms for similarity join with edit distance constraints. Existing approaches are mainly based on converting the edit distance constraint to a weaker constraint on the number of matching q-grams between pair of strings. In this paper, we propose the novel perspective of investigating mismatching q-grams. Technically, we derive two new edit distance lower bounds by analyzing the locations and contents of mismatching q-grams. A new algorithm, EdJoin, is proposed that exploits the new mismatch-based filtering methods; it achieves substantial reduction of the candidate sizes and hence saves computation time. We demonstrate experimentally that the new algorithm outperforms alternative methods on large-scale real datasets unde...
Chuan Xiao, Wei Wang 0011, Xuemin Lin
Added 28 Dec 2010
Updated 28 Dec 2010
Type Journal
Year 2008
Where PVLDB
Authors Chuan Xiao, Wei Wang 0011, Xuemin Lin
Comments (0)