Quality-Based Similarity Search for Biological Sequence Databases

15 years 8 days ago

Download www.cise.ufl.edu

Low-Complexity Regions (LCRs) of biological sequences are the main source of false positives in similarity searches for biological sequence databases. We consider the problem of ﬁnding similar sequences when the locations of the LCRs are not known precisely. We develop a formulation to measure the quality of each letter in a sequence. The quality value of a letter is the probability for that letter to be in a non-LCR. We show that the quality values can be employed in two fundamental approaches to the sequence search problem to reduce the number of false positives produced by them signiﬁcantly. The former ﬁnds the optimal alignment of two sequences using dynamic programming. The latter computes a suboptimal alignment using hash table. For the latter one, we also develop a randomized memory-resident hash table that indexes k-grams (sequences of length k) probabilistically. The kgrams that are likely to contain LCRs are indexed with lower probabilities. As a result, memory usage a...

Xuehui Li, Tamer Kahveci

Real-time Traffic