Sciweavers

SIGMOD
2007
ACM

A random walk approach to sampling hidden databases

14 years 4 months ago
A random walk approach to sampling hidden databases
A large part of the data on the World Wide Web is hidden behind form-like interfaces. These interfaces interact with a hidden backend database to provide answers to user queries. Generating a uniform random sample of this hidden database by using only the publicly available interface gives us access to the underlying data distribution. In this paper, we propose a random walk scheme over the query space provided by the interface to sample such databases. We discuss variants where the query space is visualized as a fixed and random ordering of attributes. We also propose techniques to further improve the sample quality by using a probabilistic rejection based approach. We conduct extensive experiments to illustrate the accuracy and efficiency of our techniques. Categories and Subject Descriptors H.3.3 Information Search and Retrieval General Terms Algorithms, Design, Performance, Measurement Keywords Hidden databases, sampling, top-k interfaces, random walk
Arjun Dasgupta, Gautam Das, Heikki Mannila
Added 08 Dec 2009
Updated 08 Dec 2009
Type Conference
Year 2007
Where SIGMOD
Authors Arjun Dasgupta, Gautam Das, Heikki Mannila
Comments (0)