Near Similarity Search and Plagiarism Analysis

15 years 11 months ago

Download www.uni-weimar.de

Abstract. Existing methods to text plagiarism analysis mainly base on “chunking”, a process of grouping a text into meaningful units each of which gets encoded by an integer number. Together theses numbers form a document’s signature or ﬁngerprint. An overlap of two documents’ ﬁngerprints indicate a possibly plagiarized text passage. Most approaches use MD5 hashes to construct ﬁngerprints, which is bound up with two problems: (i) it is computationally expensive, (ii) a small chunk size must be chosen to identify matching passages, which additionally increases the eﬀort for ﬁngerprint computation, ﬁngerprint comparison, and ﬁngerprint storage. This paper proposes a new class of ﬁngerprints that can be considered as an abstraction of the classical vector space model. These ﬁngerprints operationalize the concept of “near similarity” and enable one to quickly identify candidate passages for plagiarism. Experiments show that a plagiarism analysis based on our �...

Benno Stein, Sven Meyer zu Eissen

Real-time Traffic