Sciweavers

103 search results - page 2 / 21
» Models and Algorithms for Duplicate Document Detection
Sort
View
ADC
2007
Springer
108views Database» more  ADC 2007»
13 years 11 months ago
Distributed Text Retrieval From Overlapping Collections
In standard text retrieval systems, the documents are gathered and indexed on a single server. In distributed information retrieval (DIR), the documents are held in multiple colle...
Milad Shokouhi, Justin Zobel, Yaniv Bernstein
SIGIR
2008
ACM
13 years 5 months ago
SpotSigs: robust and efficient near duplicate detection in large web collections
Motivated by our work with political scientists who need to manually analyze large Web archives of news sites, we present SpotSigs, a new algorithm for extracting and matching sig...
Martin Theobald, Jonathan Siddharth, Andreas Paepc...
OOPSLA
2005
Springer
13 years 10 months ago
SDD: high performance code clone detection system for large scale source code
Code clones in software increase maintenance cost and lower software quality. We have devised a new algorithm to detect duplicated parts of source code in large software. Our algo...
Seunghak Lee, Iryoung Jeong
DGO
2006
134views Education» more  DGO 2006»
13 years 6 months ago
Next steps in near-duplicate detection for eRulemaking
Large volume public comment campaigns and web portals that encourage the public to customize form letters produce many near-duplicate documents, which increases processing and sto...
Hui Yang, Jamie Callan, Stuart W. Shulman
SIGIR
2010
ACM
13 years 22 hour ago
Efficient partial-duplicate detection based on sequence matching
With the ever-increasing growth of the Internet, numerous copies of documents become serious problem for search engine, opinion mining and many other web applications. Since parti...
Qi Zhang, Yue Zhang, Haomin Yu, Xuanjing Huang