Approximating the top-m passages in a parallel question answering system

13 years 10 months ago

Download plg.uwaterloo.ca

We examine the problem of retrieving the top-m ranked items from a large collection, randomly distributed across an n-node system. In order to retrieve the top m overall, we must retrieve the top m from the subcollection stored on each node and merge the results. However, if we are willing to accept a small probability that one or more of the top-m items may be missed, it is possible to reduce computation time by retrieving only the top k < m from each node. In this paper, we demonstrate that this simple observation can be exploited in a realistic application to produce a substantial eﬃciency improvement without compromising the quality of the retrieved results. To support our claim, we present a statistical model that predicts the impact of the optimization. The paper is structured around a speciﬁc application — passage retrieval for question answering — but the primary results are more broadly applicable. Categories and Subject Descriptors H.3.4 [Information Systems]: Inf...

Charles L. A. Clarke, Egidio L. Terra

Real-time Traffic