Genome-scale disk-based suffix tree indexing

11 years 1 months ago
Genome-scale disk-based suffix tree indexing
With the exponential growth of biological sequence databases, it has become critical to develop effective techniques for storing, querying, and analyzing these massive data. Suffix trees are widely used to solve many sequence-based problems, and they can be built in linear time and space, provided the resulting tree fits in main-memory. To index larger sequences, several external suffix tree algorithms have been proposed in recent years. However, they suffer from several problems such as susceptibility to data skew, non-scalability to genome-scale sequences, and non-existence of suffix links, which are crucial in various suffix tree based algorithms. In this paper, we target DNA sequences and propose a novel disk-based suffix tree algorithm called Trellis which effectively scales up to genome-scale sequences. Specifically, it can index the entire human genome using 2GB of memory, in about 4 hours and can recover all its suffix links within 2 hours. Trellis was compared to various stat...
Benjarath Phoophakdee, Mohammed J. Zaki
Added 08 Dec 2009
Updated 08 Dec 2009
Type Conference
Year 2007
Authors Benjarath Phoophakdee, Mohammed J. Zaki
Comments (0)