This paper describes an algorithm for the determination of zone content type of a given zone within a document image. We take a statistical based approach and represent each zone ...
We describe an HTML web page segmentation algorithm, which is applied to segment online medical journal articles (regular HTML and PDF-Converted-HTML files). The web page content ...
Term signal is an existing text representation that depicts a term as a vector of frequencies of occurrences in a number of user-defined partitions of a document. Although term si...
Supphachai Thaicharoen, Tom Altman, Krzysztof J. C...
This paper presents a dynamic approach to document page segmentation based on inter-component relationships and their local features. State-of-the art page segmentation algorithms...
The goal of the DARPA MADCAT (Multilingual Automatic Document Classification Analysis and Translation) Program is to automatically convert foreign language text images into Englis...