Sciweavers

Share
SIGIR
2010
ACM

Adaptive near-duplicate detection via similarity learning

11 years 3 months ago
Adaptive near-duplicate detection via similarity learning
In this paper, we present a novel near-duplicate document detection method that can easily be tuned for a particular domain. Our method represents each document as a real-valued sparse k-gram vector, where the weights are learned to optimize for a specified similarity function, such as the cosine similarity or the Jaccard coefficient. Near-duplicate documents can be reliably detected through this improved similarity measure. In addition, these vectors can be mapped to a small number of hash-values as document signatures through the locality sensitive hashing scheme for efficient similarity computation. We demonstrate our approach in two target domains: Web news articles and email messages. Our method is not only more accurate than the commonly used methods such as Shingles and I-Match, but also shows consistent improvement across the domains, which is a desired property lacked by existing methods. Categories and Subject Descriptors H.3.1 [Information Storage and Retrieval]: Content...
Hannaneh Hajishirzi, Wen-tau Yih, Aleksander Kolcz
Added 16 Aug 2010
Updated 16 Aug 2010
Type Conference
Year 2010
Where SIGIR
Authors Hannaneh Hajishirzi, Wen-tau Yih, Aleksander Kolcz
Comments (0)
books