Tighter estimation using bottom k sketches

12 years 9 months ago
Tighter estimation using bottom k sketches
Summaries of massive data sets support approximate query processing over the original data. A basic aggregate over a set of records is the weight of subpopulations specified as a predicate over records' attributes. Bottom-k sketches are a powerful summarization format of weighted items that includes priority sampling [22], and the classic weighted sampling without replacement. They can be computed efficiently for many representations of the data including distributed databases and data streams and support coordinated and all-distances sketches. We derive novel unbiased estimators and confidence bounds for subpopulation weight. Our rank conditioning (RC) estimator is applicable when the total weight of the sketched set cannot be computed by the summarization algorithm without a significant use of additional resources (such as for sketches of network neighborhoods) and the tighter subset conditioning (SC) estimator that is applicable when the total weight is available (sketches of ...
Edith Cohen, Haim Kaplan
Added 28 Dec 2010
Updated 28 Dec 2010
Type Journal
Year 2008
Authors Edith Cohen, Haim Kaplan
Comments (0)