Word-Based Statistical Compressors as Natural Language Compression Boosters

16 years 5 months ago

Download www.dcc.uchile.cl

Semistatic word-based byte-oriented compression codes are known to be attractive alternatives to compress natural language texts. With compression ratios around 30%, they allow direct pattern searching on the compressed text up to 8 times faster than on its uncompressed version. In this paper we reveal that these compressors have even more benefits. We show that most of the state-of-the-art compressors such as the block-wise bzip2, those from the Ziv-Lempel family, and the predictive ppm-based ones, can benefit from compressing not the original text, but its compressed representation obtained by a word-based byte-oriented statistical compressor. In particular, our experimental results show that using Dense-Code-based compression as a preprocessing step to classical compressors like bzip2, gzip, or ppmdi, yields several important benefits. For example, the ppm family is known for achieving the best compression ratios. With a Dense coding preprocessing, ppmdi achieves even better compre...

Antonio Fariña, Gonzalo Navarro, José

Real-time Traffic

Better Compression Ratios | Computer Graphics | DCC 2008 | Semistatic Word-based Compression | Word-based Byte-oriented Compression |

claim paper

Post Info
More Details (n/a)

Added	25 Dec 2009
Updated	25 Dec 2009
Type	Conference
Year	2008
Where	DCC
Authors	Antonio Fariña, Gonzalo Navarro, José R. Paramá

Comments (0)

Sciweavers

Word-Based Statistical Compressors as Natural Language Compression Boosters

Better Compression Ratios | Computer Graphics | DCC 2008 | Semistatic Word-based Compression | Word-based Byte-oriented Compression |

Explore & Download

Productivity Tools

Sciweavers