Near Similarity Search and Plagiarism Analysis

8 years 9 months ago
Near Similarity Search and Plagiarism Analysis
Abstract. Existing methods to text plagiarism analysis mainly base on “chunking”, a process of grouping a text into meaningful units each of which gets encoded by an integer number. Together theses numbers form a document’s signature or fingerprint. An overlap of two documents’ fingerprints indicate a possibly plagiarized text passage. Most approaches use MD5 hashes to construct fingerprints, which is bound up with two problems: (i) it is computationally expensive, (ii) a small chunk size must be chosen to identify matching passages, which additionally increases the effort for fingerprint computation, fingerprint comparison, and fingerprint storage. This paper proposes a new class of fingerprints that can be considered as an abstraction of the classical vector space model. These fingerprints operationalize the concept of “near similarity” and enable one to quickly identify candidate passages for plagiarism. Experiments show that a plagiarism analysis based on our ...
Benno Stein, Sven Meyer zu Eissen
Added 27 Jun 2010
Updated 27 Jun 2010
Type Conference
Year 2005
Where GFKL
Authors Benno Stein, Sven Meyer zu Eissen
Comments (0)