Sciweavers

FAST
2009

Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality

13 years 2 months ago
Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality
We present sparse indexing, a technique that uses sampling and exploits the inherent locality within backup streams to solve for large-scale backup (e.g., hundreds of terabytes) the chunk-lookup disk bottleneck problem that inline, chunk-based deduplication schemes face. The problem is that these schemes traditionally require a full chunk index, which indexes every chunk, in order to determine which chunks have already been stored; unfortunately, at scale it is impractical to keep such an index in RAM and a disk-based index with one seek per incoming chunk is far too slow. We perform stream deduplication by breaking up an incoming stream into relatively large segments and deduplicating each segment against only a few of the most similar previous segments. To identify similar segments, we use sampling and a sparse index. We choose a small portion of the chunks in the stream as samples; our sparse index maps these samples to the existing segments in which they occur. Thus, we avoid the ...
Mark Lillibridge, Kave Eshghi, Deepavali Bhagwat,
Added 17 Feb 2011
Updated 17 Feb 2011
Type Journal
Year 2009
Where FAST
Authors Mark Lillibridge, Kave Eshghi, Deepavali Bhagwat, Vinay Deolalikar, Greg Trezis, Peter Camble
Comments (0)