Free Online Productivity Tools
i2Speak
i2Symbol
i2OCR
iTex2Img
iWeb2Print
iWeb2Shot
i2Type
iPdf2Split
iPdf2Merge
i2Bopomofo
i2Arabic
i2Style
i2Image
i2PDF
iLatex2Rtf
Sci2ools

BIOCOMP

2007

2007

Low-Complexity Regions (LCRs) of biological sequences are the main source of false positives in similarity searches for biological sequence databases. We consider the problem of ﬁnding similar sequences when the locations of the LCRs are not known precisely. We develop a formulation to measure the quality of each letter in a sequence. The quality value of a letter is the probability for that letter to be in a non-LCR. We show that the quality values can be employed in two fundamental approaches to the sequence search problem to reduce the number of false positives produced by them signiﬁcantly. The former ﬁnds the optimal alignment of two sequences using dynamic programming. The latter computes a suboptimal alignment using hash table. For the latter one, we also develop a randomized memory-resident hash table that indexes k-grams (sequences of length k) probabilistically. The kgrams that are likely to contain LCRs are indexed with lower probabilities. As a result, memory usage a...

Related Content

Added |
29 Oct 2010 |

Updated |
29 Oct 2010 |

Type |
Conference |

Year |
2007 |

Where |
BIOCOMP |

Authors |
Xuehui Li, Tamer Kahveci |

Comments (0)