Stratified Reservoir Sampling over Heterogeneous Data Streams

8 years 10 months ago
Stratified Reservoir Sampling over Heterogeneous Data Streams
Reservoir sampling is a well-known technique for random sampling over data streams. In many streaming applications, however, an input stream may be naturally heterogeneous, i.e., composed of substreams whose statistical properties may also vary considerably. For this class of applications, the conventional reservoir sampling technique does not guarantee a statistically sufficient number of tuples from each substream to be included in the reservoir, and this can cause a damage on the statistical quality of the sample. In this paper, we deal with this heterogeneity problem by stratifying the reservoir sample among the underlying sub-streams. We particularly consider situations in which the stratified reservoir sample is needed to obtain reliable estimates at the level of either the entire data stream or individual sub-streams. The first challenge in this stratification is to achieve an optimal allocation of a fixed-size reservoir to individual sub-streams. The second challenge is to adap...
Mohammed Al-Kateb, Byung Suk Lee
Added 15 Feb 2011
Updated 15 Feb 2011
Type Journal
Year 2010
Authors Mohammed Al-Kateb, Byung Suk Lee
Comments (0)