Probabilistic near-duplicate detection using simhash

12 years 4 months ago

Download irl.cs.tamu.edu

This paper oﬀers a novel look at using a dimensionalityreduction technique called simhash [8] to detect similar document pairs in large-scale collections. We show that this algorithm produces interesting intermediate data, which is normally discarded, that can be used to predict which of the bits in the ﬁnal hash are more susceptible to being ﬂipped in similar documents. This paves the way for a probabilistic search technique in the Hamming space of simhashes that can be signiﬁcantly faster and more space-eﬃcient than the existing simhash approaches. We show that with 95% recall compared to deterministic search of prior work [16], our method exhibits 4-14 times faster lookup and requires 2-10 times less RAM on our collection of 70M web pages. Categories and Subject Descriptors H.3.3 [Information Search and Retrieval]: Clustering General Terms Algorithms Keywords Hamming distance, similarity, simhash, clustering

Sadhan Sood, Dmitri Loguinov

Real-time Traffic

CIKM 2011 | Deterministic Search | Hamming Distance | Information Technology | Probabilistic Search |

claim paper

Added	13 Dec 2011
Updated	13 Dec 2011
Type	Journal
Year	2011
Where	CIKM
Authors	Sadhan Sood, Dmitri Loguinov

Sciweavers

Probabilistic near-duplicate detection using simhash

CIKM 2011 | Deterministic Search | Hamming Distance | Information Technology | Probabilistic Search |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers