Near-duplicate detection by instance-level constrained clustering

9 years 8 months ago
Near-duplicate detection by instance-level constrained clustering
For the task of near-duplicated document detection, both traditional fingerprinting techniques used in database community and bag-of-word comparison approaches used in information retrieval community are not sufficiently accurate. This is due to the fact that the characteristics of near-duplicated documents are different from that of both “almost-identical” documents in the data cleaning task and “relevant” documents in the search task. This paper presents an instance-level constrained clustering approach for near-duplicate detection. The framework incorporates information such as document attributes and content structure into the clustering process to form near-duplicate clusters. Gathered from several collections of public comments sent to U.S. government agencies on proposed new regulations, the experimental results demonstrate that our approach outperforms other near-duplicate detection algorithms and as about as effective as human assessors. Categories and Subject Descrip...
Hui Yang, James P. Callan
Added 14 Jun 2010
Updated 14 Jun 2010
Type Conference
Year 2006
Authors Hui Yang, James P. Callan
Comments (0)