Sciweavers

CLEF
2011
Springer

Intrinsic Plagiarism Detection Using Character Trigram Distance Scores - Notebook for PAN at CLEF 2011

12 years 4 months ago
Intrinsic Plagiarism Detection Using Character Trigram Distance Scores - Notebook for PAN at CLEF 2011
Abstract In this paper, we describe a novel approach to intrinsic plagiarism detection. Each suspicious document is divided into a series of consecutive, potentially overlapping ‘windows’ of equal size. These are represented by vectors containing the relative frequencies of a predetermined set of high-frequency character trigrams. Subsequently, a distance matrix is set up in which each of the document’s windows is compared to each other window. The distance measure used is a symmetric adaptation of the normalized distance (nd1) proposed by Stamatatos [17]. Finally, an algorithm for outlier detection in multivariate data (based on Principal Components Analysis) is applied to the distance matrix in order to detect plagiarized sections. In the PAN-PC-2011 competition, this system (second place) achieved a competitive recall (.4279) but only reached a plagdet of .1679 due to a disappointing precision (.1075).
Mike Kestemont, Kim Luyckx, Walter Daelemans
Added 18 Dec 2011
Updated 18 Dec 2011
Type Journal
Year 2011
Where CLEF
Authors Mike Kestemont, Kim Luyckx, Walter Daelemans
Comments (0)