Sciweavers

IPM
2007

Using structural contexts to compress semistructured text collections

13 years 4 months ago
Using structural contexts to compress semistructured text collections
We describe a compression model for semistructured documents, called Structural Contexts Model (SCM), which takes advantage of the context information usually implicit in the structure of the text. The idea is to use a separate model to compress the text that lies inside each different structure type (e.g., different XML tag). The intuition behind SCM is that the distribution of all the texts that belong to a given structure type should be similar, and different from that of other structure types. We mainly focus on semistatic models, and test our idea using a word-based Huffman method. This is the standard for compressing large natural language text databases, because random access, partial decompression, and direct search of the compressed collection is possible. This variant, dubbed SCMHuff, retains those features and improves Huffman’s compression ratios. We consider the possibility that storing separate models may not pay off if the distribution of different structure t...
Joaquín Adiego, Gonzalo Navarro, Pablo de l
Added 15 Dec 2010
Updated 15 Dec 2010
Type Journal
Year 2007
Where IPM
Authors Joaquín Adiego, Gonzalo Navarro, Pablo de la Fuente
Comments (0)