We consider the problem of efficiently sampling Web search engine query results. In turn, using a small random sample instead of the full set of results leads to efficient approximate algorithms for several applications, such as: ? Determining the set of categories in a given taxonomy spanned by the search results; ? Finding the range of metadata values associated to the result set in order to enable "multi-faceted search;" ? Estimating the size of the result set; ? Data mining associations to the query terms. We present and analyze an efficient algorithm for obtaining uniform random samples applicable to any search engine based on posting lists and document-at-a-time evaluation. (To our knowledge, all popular Web search engines, e.g. Google, Inktomi, AltaVista, AllTheWeb, belong to this class.) Furthermore, our algorithm can be modified to follow the modern object-oriented approach whereby posting lists are viewed as streams equipped with a next method, and the next method ...
Aris Anagnostopoulos, Andrei Z. Broder, David Carm