Sampling search-engine results

10 years 8 months ago
Sampling search-engine results
We consider the problem of efficiently sampling Web search engine query results. In turn, using a small random sample instead of the full set of results leads to efficient approximate algorithms for several applications, such as: ? Determining the set of categories in a given taxonomy spanned by the search results; ? Finding the range of metadata values associated to the result set in order to enable "multi-faceted search;" ? Estimating the size of the result set; ? Data mining associations to the query terms. We present and analyze an efficient algorithm for obtaining uniform random samples applicable to any search engine based on posting lists and document-at-a-time evaluation. (To our knowledge, all popular Web search engines, e.g. Google, Inktomi, AltaVista, AllTheWeb, belong to this class.) Furthermore, our algorithm can be modified to follow the modern object-oriented approach whereby posting lists are viewed as streams equipped with a next method, and the next method ...
Aris Anagnostopoulos, Andrei Z. Broder, David Carm
Added 22 Nov 2009
Updated 22 Nov 2009
Type Conference
Year 2005
Where WWW
Authors Aris Anagnostopoulos, Andrei Z. Broder, David Carmel
Comments (0)