This paper presents an Italic/Roman word type recognition system without a priori knowledge on the characters' font. This method aims at analyzing old documents in which char...
Temporal reasoners for document understanding typically assume that a document’s creation date is known. Algorithms to ground relative time expressions and order events often re...
The Dirichlet compound multinomial (DCM) distribution, also called the multivariate Polya distribution, is a model for text documents that takes into account burstiness: the fact ...
In the AllRight project, we are developing an algorithm for unsupervised table detection and segmentation that uses the visual rendition of a Web page rather than the HTML code. O...
This paper presents a generic features selection method and its applications on some document analysis problems. The method is based on a genetic algorithm (GA), whose tness funct...