We review the literature on automatic document formatting with an emphasis on recent work in the field. One common way to frame document formatting is as a constrained optimizatio...
This paper investigates methods to automatically infer structural information from large XML documents. Using XML as a reference format, we approach the schema generation problem ...
Table of contents (TOC) recognition has attracted a great deal of attention in recent years. After reviewing the merits and drawbacks of the existing TOC recognition methods, we h...
In this paper we present an OCR validation module, implemented for the System for Preservation of Electronic Resources (SPER) developed at the U.S. National Library of Medicine.1 ...
A variety of different scripts are used in writing languages throughout the world. In a multi-script, multilingual environment, it is essential to know the script used in writing a...