Sciweavers

Free Online Productivity Tools i2Speak i2Symbol i2OCR iTex2Img iWeb2Print iWeb2Shot i2Type iPdf2Split iPdf2Merge i2Bopomofo i2Arabic i2Style i2Image i2PDF iLatex2Rtf Sci2ools

13

ACL
2008

favoriteEmaildiscussreport

153views Computational Linguistics» more ACL 2008»

Pairwise Document Similarity in Large Collections with MapReduce

13 years 6 months ago

Pairwise Document Similarity in Large Collections with MapReduce

Download www.umiacs.umd.edu

This paper presents a MapReduce algorithm for computing pairwise document similarity in large document collections. MapReduce is an attractive framework because it allows us to decompose the inner products involved in computing document similarity into separate multiplication and summation stages in a way that is well matched to efficient disk access patterns across several machines. On a collection consisting of approximately 900,000 newswire articles, our algorithm exhibits linear growth in running time and space in terms of the number of documents.

Tamer Elsayed, Jimmy J. Lin, Douglas W. Oard

Real-time Traffic

ACL 2008 | Computational Linguistics | Document Similarity | Large Document Collections | Pairwise Document Similarity |

claim paper

Related Content

» Brute force and indexed approaches to pairwise document similarity comparisons with MapRed...

» No free lunch brute force vs localitysensitive hashing for crosslingual pairwise similarit...

» WebScale Distributional Similarity and Entity Set Expansion

» A Method for Calculating Term Similarity on Large Document Collections

» Learning Pairwise Similarity for Data Clustering

» Efficient partialduplicate detection based on sequence matching

» Learning optimally diverse rankings over large document collections

» A tool set for the quick and efficient exploration of large document collections

» Computational Models of Information ScentFollowing in a Very Large Browsable Text Collecti...

Post Info
More Details (n/a)

Added	29 Oct 2010
Updated	29 Oct 2010
Type	Conference
Year	2008
Where	ACL
Authors	Tamer Elsayed, Jimmy J. Lin, Douglas W. Oard

Comments (0)