Sciweavers

52 search results - page 1 / 11
» Identifying and Filtering Near-Duplicate Documents
Sort
View
CPM
2000
Springer
177views Combinatorics» more  CPM 2000»
13 years 9 months ago
Identifying and Filtering Near-Duplicate Documents
Abstract. The mathematical concept of document resemblance captures well the informal notion of syntactic similarity. The resemblance can be estimated using a fixed size “sketch...
Andrei Z. Broder
SIGIR
2008
ACM
13 years 5 months ago
SpotSigs: robust and efficient near duplicate detection in large web collections
Motivated by our work with political scientists who need to manually analyze large Web archives of news sites, we present SpotSigs, a new algorithm for extracting and matching sig...
Martin Theobald, Jonathan Siddharth, Andreas Paepc...
SIGIR
2004
ACM
13 years 10 months ago
Constructing a text corpus for inexact duplicate detection
As online document collections continue to expand, both on the Web and in proprietary environments, the need for duplicate detection becomes more critical. The goal of this work i...
Jack G. Conrad, Cindy P. Schriber
ICSE
2011
IEEE-ACM
12 years 8 months ago
StakeSource2.0: using social networks of stakeholders to identify and prioritise requirements
Software projects typically rely on system analysts to conduct requirements elicitation, an approach potentially costly for large projects with many stakeholders and requirements....
Soo Ling Lim, Daniela Damian, Anthony Finkelstein
KDD
2004
ACM
195views Data Mining» more  KDD 2004»
14 years 5 months ago
Improved robustness of signature-based near-replica detection via lexicon randomization
Detection of near duplicate documents is an important problem in many data mining and information filtering applications. When faced with massive quantities of data, traditional d...
Aleksander Kolcz, Abdur Chowdhury, Joshua Alspecto...