Improved robustness of signature-based near-replica detection via lexicon randomization

16 years 6 months ago

Download ir.iit.edu

Detection of near duplicate documents is an important problem in many data mining and information filtering applications. When faced with massive quantities of data, traditional duplicate detection techniques relying on direct interdocument similarity computation (e.g., using the cosine measure) are often not feasible given the time and memory performance constraints. On the other hand, fingerprint-based methods, such as I-Match, are very attractive computationally but may be brittle with respect to small changes to document content. We focus on approaches to nearreplica detection that are based upon large-collection statistics and present a general technique of increasing their robustness via multiple lexicon randomization. In experiments with large web-page and spam-email datasets the proposed method is shown to consistently outperform traditional I-Match, with the relative improvement in duplicatedocument recall reaching as high as 40-60%. The large gains in detection accuracy are ...

Aleksander Kolcz, Abdur Chowdhury, Joshua Alspecto

Real-time Traffic

Data Mining | Detection Accuracy | KDD 2004 | Multiple Lexicon Randomization | Traditional Duplicate Detection |

claim paper

Post Info
More Details (n/a)

Added	30 Nov 2009
Updated	30 Nov 2009
Type	Conference
Year	2004
Where	KDD
Authors	Aleksander Kolcz, Abdur Chowdhury, Joshua Alspector

Comments (0)

Sciweavers

Improved robustness of signature-based near-replica detection via lexicon randomization

Data Mining | Detection Accuracy | KDD 2004 | Multiple Lexicon Randomization | Traditional Duplicate Detection |

Explore & Download

Productivity Tools

Sciweavers