Sciweavers

Free Online Productivity Tools i2Speak i2Symbol i2OCR iTex2Img iWeb2Print iWeb2Shot i2Type iPdf2Split iPdf2Merge i2Bopomofo i2Arabic i2Style i2Image i2PDF iLatex2Rtf Sci2ools

10

APWEB
2006
Springer

favoriteEmaildiscussreport

102views Internet Technology» more APWEB 2006»

The Case of the Duplicate Documents Measurement, Search, and Science

13 years 8 months ago

The Case of the Duplicate Documents Measurement, Search, and Science

Download goanna.cs.rmit.edu.au

Many of the documents in large text collections are duplicates and versions of each other. In recent research, we developed new methods for finding such duplicates; however, as there was no directly comparable prior work, we had no measure of whether we had succeeded. Worse, the concept of "duplicate" not only proved difficult to define, but on reflection was not logically defensible. Our investigation highlighted a paradox of computer science research: objective measurement of outcomes involves a subjective choice of preferred measure; and attempts to define measures can easily founder in circular reasoning. Also, some measures are abstractions that simplify complex real-world phenomena, so success by a measure may not be meaningful outside the context of the research. These are not merely academic concerns, but are significant problems in the design of research projects. In this paper, the case of the duplicate documents is used to explore whether and when it is reasonable ...

Justin Zobel, Yaniv Bernstein

Real-time Traffic

APWEB 2006 | Comparable Prior Work | Complex Real-world Phenomena | Internet Technology | Large Text Collections |

claim paper

Related Content

» Constructing a text corpus for inexact duplicate detection

» Efficient search engine measurements

» Measuring the similarity between implicit semantic relations from the web

» Understanding Content Reuse on the Web Static and Dynamic Analyses

» Semantic Web Search Model for Information Retrieval of the Semantic Data

» Timebased calibration of effectiveness measures

» Prefix Tree Indexing for Similarity Search and Similarity Joins on Genomic Data

» Towards an Ontology for eDocument Management in Public Administration the Case of Schlesw...

» Estimating the Relevance of Search Results in the CultureWeb A Study of Semantic Distance ...

Post Info
More Details (n/a)

Added	20 Aug 2010
Updated	20 Aug 2010
Type	Conference
Year	2006
Where	APWEB
Authors	Justin Zobel, Yaniv Bernstein

Comments (0)