Detecting Co-Derivative Documents in Large Text Collections

9 years 11 months ago
Detecting Co-Derivative Documents in Large Text Collections
We have analyzed the SPEX algorithm by Bernstein and Zobel (2004) for detecting co-derivative documents using duplicate n-grams. Although we totally agree with the claim that not using unique n-grams can greatly increase the efficiency and scalability of the process of detecting co-derivative documents, we have found serious bottlenecks in the way SPEX finds the duplicate n-grams. While the memory requirements for computing co-derivative documents can be reduced to up to 1 % by only using duplicate n-grams, SPEX needs about 40 times more memory for computing the list of duplicate n-grams itself. Therefore the memory requirements of the whole process are not reduced enough to make the algorithm practical for very large collections. We propose a solution for this problem using an external sort with the suffix array in-memory sorting and temporary file compression. The proposed algorithm for computing duplicate n-grams uses a fixed amount of memory for any input size.
Jan Pomikálek, Pavel Rychlý
Added 29 Oct 2010
Updated 29 Oct 2010
Type Conference
Year 2008
Where LREC
Authors Jan Pomikálek, Pavel Rychlý
Comments (0)