We present in this article the methods we used for obtaining measures to ensure the quality and well-formedness of a text corpus. These measures allow us to determine the compatib...
This paper proposes a non-interactive system for reducing the level of OCR-induced typographical variation in large text collections, contemporary and historical. Text-Induced Corp...
We present an experimental framework for Entity Mention Detection in which two different classifiers are combined to exploit Data Redundancy attained through the annotation of a l...
This paper is a comparative study of feature selection methods in statistical learning of text categorization. The focus is on aggressive dimensionality reduction. Five methods we...