Page segmentation into text and non-text components is an essential preprocessing step before OCR operation. If this is not done properly, an OCR classification engine produces g...
Syed Saqib Bukhari, Faisal Shafait, Thomas M. Breu...
This paper presents an approach for identifying similar documents that can be used to assist scientists in finding related work. The approach called Citation Proximity Analysis (C...
Jacqueline Leta, Birger Larsen, Ronald Rousseau, W...
We present a method for automated topic suggestion. Given a plain-text input document, our algorithm produces a ranking of novel topics that could enrich the input document in a m...
With large databases of document images available, a method for users to find keywords in documents will be useful. One approach is to perform Optical Character Recognition (OCR) ...
We review the literature on automatic document formatting with an emphasis on recent work in the field. One common way to frame document formatting is as a constrained optimizatio...