Sciweavers

SIGIR
1998
ACM

Fast Searching on Compressed Text Allowing Errors

13 years 8 months ago
Fast Searching on Compressed Text Allowing Errors
Abstract We present a fast compression and decompression scheme for natural language texts that allows e cient and exible string matching by searching the compressed text directly. The compression scheme uses a word-based Hu man encoding and the coding alphabet is byte-oriented rather than bit-oriented. We compress typical English texts to about 30% of their original size, against 40% and 35% for Compress and Gzip, respectively. Compression times are close to the times of Compress and approximately half the times of Gzip, and decompression times are lower than those of Gzip and one third of those of Compress. The searching algorithm allows a large number of variations of the exact and approximate compressed string matching problem, such as phrases, ranges, complements, wild cards and arbitrary regular expressions. Separators and stopwords can be discarded at search time without signi cantly increasing the cost. The algorithm is based on a word-oriented shift-or algorithm and a fast Boy...
Edleno Silva de Moura, Gonzalo Navarro, Nivio Zivi
Added 05 Aug 2010
Updated 05 Aug 2010
Type Conference
Year 1998
Where SIGIR
Authors Edleno Silva de Moura, Gonzalo Navarro, Nivio Ziviani, Ricardo A. Baeza-Yates
Comments (0)