Large collections of scanned documents (books and journals) are now available in Digital Libraries. The most common method for retrieving relevant information from these collectio...
Conventional optical character recognition (OCR) systems operate on individual characters and words, and do not normally exploit document or collection context. We describe a Coll...
K. Pramod Sankar, C. V. Jawahar, Raghavan Manmatha
Today's knowledge workers have to spend a large amount of time and manual effort on creating, analyzing, and modifying textual content. While more advanced semantically-orient...
When humans approach the task of text categorization, they interpret the specific wording of the document in the much larger context of their background knowledge and experience. ...
It is a well documented fact that, for human readers, familiar text is more legible than unfamiliar text. Current-generation computer vision systems also are able to exploit some ...