Sciweavers

103 search results - page 1 / 21
» Models and Algorithms for Duplicate Document Detection
Sort
View
ICDAR
1999
IEEE
13 years 9 months ago
Models and Algorithms for Duplicate Document Detection
This paper introduces a framework for clarifying and formalizing the duplicate document detection problem. Four distinct models are presented, each with a corresponding algorithm ...
Daniel P. Lopresti
SIGMOD
2005
ACM
119views Database» more  SIGMOD 2005»
14 years 5 months ago
DogmatiX Tracks down Duplicates in XML
Duplicate detection is the problem of detecting different entries in a data source representing the same real-world entity. While research abounds in the realm of duplicate detect...
Melanie Weis, Felix Naumann
LREC
2008
130views Education» more  LREC 2008»
13 years 6 months ago
Detecting Co-Derivative Documents in Large Text Collections
We have analyzed the SPEX algorithm by Bernstein and Zobel (2004) for detecting co-derivative documents using duplicate n-grams. Although we totally agree with the claim that not ...
Jan Pomikálek, Pavel Rychlý
SIGIR
2004
ACM
13 years 10 months ago
Constructing a text corpus for inexact duplicate detection
As online document collections continue to expand, both on the Web and in proprietary environments, the need for duplicate detection becomes more critical. The goal of this work i...
Jack G. Conrad, Cindy P. Schriber
ICDE
2010
IEEE
204views Database» more  ICDE 2010»
13 years 11 months ago
ProbClean: A probabilistic duplicate detection system
— One of the most prominent data quality problems is the existence of duplicate records. Current data cleaning systems usually produce one clean instance (repair) of the input da...
George Beskales, Mohamed A. Soliman, Ihab F. Ilyas...