Sciweavers

DCC
2008
IEEE

Word-Based Statistical Compressors as Natural Language Compression Boosters

14 years 4 months ago
Word-Based Statistical Compressors as Natural Language Compression Boosters
Semistatic word-based byte-oriented compression codes are known to be attractive alternatives to compress natural language texts. With compression ratios around 30%, they allow direct pattern searching on the compressed text up to 8 times faster than on its uncompressed version. In this paper we reveal that these compressors have even more benefits. We show that most of the state-of-the-art compressors such as the block-wise bzip2, those from the Ziv-Lempel family, and the predictive ppm-based ones, can benefit from compressing not the original text, but its compressed representation obtained by a word-based byte-oriented statistical compressor. In particular, our experimental results show that using Dense-Code-based compression as a preprocessing step to classical compressors like bzip2, gzip, or ppmdi, yields several important benefits. For example, the ppm family is known for achieving the best compression ratios. With a Dense coding preprocessing, ppmdi achieves even better compre...
Antonio Fariña, Gonzalo Navarro, José
Added 25 Dec 2009
Updated 25 Dec 2009
Type Conference
Year 2008
Where DCC
Authors Antonio Fariña, Gonzalo Navarro, José R. Paramá
Comments (0)