Brute force and indexed approaches to pairwise document similarity comparisons with MapReduce

15 years 10 months ago

Download www.umiacs.umd.edu

This paper explores the problem of computing pairwise similarity on document collections, focusing on the application of “more like this” queries in the life sciences domain. Three MapReduce algorithms are introduced: one based on brute force, a second where the problem is treated as large-scale ad hoc retrieval, and a third based on the Cartesian product of postings lists. Each algorithm supports one or more approximations that trade eﬀectiveness for eﬃciency, the characteristics of which are studied experimentally. Results show that the brute force algorithm is the most eﬃcient of the three when exact similarity is desired. However, the other two algorithms support approximations that yield large efﬁciency gains without signiﬁcant loss of eﬀectiveness. Categories and Subject Descriptors: H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval General Terms: Algorithms, Performance

Jimmy J. Lin

Real-time Traffic

Ad Hoc Retrieval | Algorithms | Brute Force | Information Retrieval | SIGIR 2009 |

claim paper

Added	28 May 2010
Updated	28 May 2010
Type	Conference
Year	2009
Where	SIGIR
Authors	Jimmy J. Lin

Sciweavers

Brute force and indexed approaches to pairwise document similarity comparisons with MapReduce

Ad Hoc Retrieval | Algorithms | Brute Force | Information Retrieval | SIGIR 2009 |

Explore & Download

Productivity Tools

Sciweavers