Sciweavers

IPM
2011

Improving semistatic compression via phrase-based modeling

12 years 8 months ago
Improving semistatic compression via phrase-based modeling
In recent years, new semistatic word-based byte-oriented text compressors, such as Tagged Huffman and those based on Dense Codes, have shown that it is possible to perform fast direct search over compressed text and decompression of arbitrary text passages over collections reduced to around 30-35% of their original size. Much of their success is due to the use of words as source symbols and a byte-oriented target alphabet. This approach broke with traditional statistical compressors, which use characters as source symbols and a bit-oriented target alphabet. In this work we go one step beyond by using phrases as source symbols. We present two new semistatic modelers that we combined with a dense coding scheme to obtain two new compressors: Pair-Based End-Tagged Dense Code (PETDC), where source symbols can be either words or pairs of words, and Phrase-Based End-Tagged Dense Code (PhETDC), which considers words and sequences of words (phrases). PETDC compresses English texts to 28-29% a...
Nieves R. Brisaboa, Antonio Fariña, Gonzalo
Added 30 Aug 2011
Updated 30 Aug 2011
Type Journal
Year 2011
Where IPM
Authors Nieves R. Brisaboa, Antonio Fariña, Gonzalo Navarro, José R. Paramá
Comments (0)