Sciweavers

SIGIR
2010
ACM

Efficient partial-duplicate detection based on sequence matching

12 years 11 months ago
Efficient partial-duplicate detection based on sequence matching
With the ever-increasing growth of the Internet, numerous copies of documents become serious problem for search engine, opinion mining and many other web applications. Since partial-duplicates only contain a small piece of text taken from other sources and most existing near-duplicate detection approaches focus on document level, partial duplicates can not be dealt with well. In this paper, we propose a novel algorithm to realize the partial-duplicate detection task. Besides the similarities between documents, our proposed algorithm can simultaneously locate the duplicated parts. The main idea is to divide the partial-duplicate detection task into two subtasks: sentence level near-duplicate detection and sequence matching. For evaluation, we compare the proposed method with other approaches on both English and Chinese web collections. Experimental results appear to support that our proposed method is effectively and efficiently to detect both partial-duplicates on large web collection...
Qi Zhang, Yue Zhang, Haomin Yu, Xuanjing Huang
Added 21 May 2011
Updated 21 May 2011
Type Journal
Year 2010
Where SIGIR
Authors Qi Zhang, Yue Zhang, Haomin Yu, Xuanjing Huang
Comments (0)