Approximate substring selectivity estimation

12 years 6 months ago
Approximate substring selectivity estimation
We study the problem of estimating selectivity of approximate substring queries. Its importance in databases is ever increasing as more and more data are input by users and are integrated with many typographical errors and different spelling conventions. To begin with, we consider edit distance for the similarity between a pair of strings. Based on information stored in an extended N-gram table, we propose two estimation algorithms, MOF and LBS for the task. The latter extends the former with ideas from set hashing signatures. The experimental results show that MOF is a light-weight algorithm that gives fairly accurate estimations. However, if more space is available, LBS can give better accuracy than MOF and other baseline methods. Next, we extend the proposed solution to other similarity predicates, SQL LIKE operator and Jaccard similarity.
Hongrae Lee, Raymond T. Ng, Kyuseok Shim
Added 19 May 2010
Updated 19 May 2010
Type Conference
Year 2009
Where EDBT
Authors Hongrae Lee, Raymond T. Ng, Kyuseok Shim
Comments (0)