Sciweavers

CIKM
2011
Springer

Factorization-based lossless compression of inverted indices

12 years 4 months ago
Factorization-based lossless compression of inverted indices
Many large-scale Web applications that require ranked top-k retrieval are implemented using inverted indices. An inverted index represents a sparse term-document matrix, where non-zero elements indicate the strength of term-document associations. In this work, we present an approach for lossless compression of inverted indices. Our approach maps terms in a document corpus to a new term space in order to reduce the number of non-zero elements in the term-document matrix, resulting in a more compact inverted index. We formulate the problem of selecting a new term space as a matrix factorization problem, and prove that finding the optimal solution is an NP-hard problem. We develop a greedy algorithm for finding an approximate solution. A side effect of our approach is increasing the number of terms in the index, which may negatively affect query evaluation performance. To eliminate such effect, we develop a methodology for modifying query evaluation algorithms by exploiting specific p...
George Beskales, Marcus Fontoura, Maxim Gurevich,
Added 13 Dec 2011
Updated 13 Dec 2011
Type Journal
Year 2011
Where CIKM
Authors George Beskales, Marcus Fontoura, Maxim Gurevich, Sergei Vassilvitskii, Vanja Josifovski
Comments (0)