Abstract. Existing methods to text plagiarism analysis mainly base on “chunking”, a process of grouping a text into meaningful units each of which gets encoded by an integer nu...
An important feature of spoken language corpora is existence of different spelling variants of words in transcription. So there is an important problem for linguist who works with...
We discuss an intrinsic generalization of the suffix tree, designed to index a string of length n which has a natural partitioning into m multicharacter substrings or words. This ...
In order to extractrigidexpressions with a high frequency of use, new algorithm that can efficientlyextract both uninterruptedand interruptedcollocationsfrom very large corpora ha...
Given a string s, the Parikh vector of s, denoted p(s), counts the multiplicity of each character in s. Searching for a match of Parikh vector q (a “jumbled string”) in the tex...
Peter Burcsi, Ferdinando Cicalese, Gabriele Fici, ...