Quality-Based Similarity Search for Biological Sequence Databases

13 years 5 months ago
Quality-Based Similarity Search for Biological Sequence Databases
Low-Complexity Regions (LCRs) of biological sequences are the main source of false positives in similarity searches for biological sequence databases. We consider the problem of finding similar sequences when the locations of the LCRs are not known precisely. We develop a formulation to measure the quality of each letter in a sequence. The quality value of a letter is the probability for that letter to be in a non-LCR. We show that the quality values can be employed in two fundamental approaches to the sequence search problem to reduce the number of false positives produced by them significantly. The former finds the optimal alignment of two sequences using dynamic programming. The latter computes a suboptimal alignment using hash table. For the latter one, we also develop a randomized memory-resident hash table that indexes k-grams (sequences of length k) probabilistically. The kgrams that are likely to contain LCRs are indexed with lower probabilities. As a result, memory usage a...
Xuehui Li, Tamer Kahveci
Added 29 Oct 2010
Updated 29 Oct 2010
Type Conference
Year 2007
Authors Xuehui Li, Tamer Kahveci
Comments (0)