Relying on the idea that back-of-the-book indexes are traditional devices for navigation through large documents, we have developed a method to build a hypertextual network that h...
Page segmentation into text and non-text components is an essential preprocessing step before OCR operation. If this is not done properly, an OCR classification engine produces g...
Syed Saqib Bukhari, Faisal Shafait, Thomas M. Breu...
The goal of the DARPA MADCAT (Multilingual Automatic Document Classification Analysis and Translation) Program is to automatically convert foreign language text images into Englis...
This paper investigates the use of the logical structure in XML documents for the retrieval of XML multimedia objects. We study different logical levels and their combinations. Our...
The performance of any OCR system heavily depends upon printing quality of the input document. Many OCRs have been designed which correctly identify fine printed documents both in...