Sciweavers

SSDBM
2010
IEEE

Prefix Tree Indexing for Similarity Search and Similarity Joins on Genomic Data

13 years 8 months ago
Prefix Tree Indexing for Similarity Search and Similarity Joins on Genomic Data
Similarity search and similarity join on strings are important for applications such as duplicate detection, error detection, data cleansing, or comparison of biological sequences. Especially DNA sequencing produces large collections of erroneous strings which need to be searched, compared, and merged. However, current RDBMS offer similarity operations only in a very limited and inefficient form that does not scale to the amount of data produced in Life Science projects. We present PETER, a prefix tree based indexing algorithm supporting approximate search and approimate joins. Our tool supports Hamming and edit distance as similarity measure and is available as C++ library, as Unix command line tool, and as cartridge for a commercial database. It combines an efficient implementation of compressed prefix trees with advanced pre-filtering techniques that exclude many candidate strings early. The achieved speed-ups are dramatic, especially for DNA with its small alphabet. We evaluate our...
Astrid Rheinländer, Martin Knobloch, Nicky Ho
Added 02 Aug 2010
Updated 02 Aug 2010
Type Conference
Year 2010
Where SSDBM
Authors Astrid Rheinländer, Martin Knobloch, Nicky Hochmuth, Ulf Leser
Comments (0)