Near-duplicate detection is not only an important pre and post processing task in Information Retrieval but also an effective spam-detection technique. Among different approache...
An enormous amount of information available via the Internet exists. Much of this data is in the form of text-based documents. These documents cover a variety of topics that are v...
This paper describes a technique for text segmentation of machine printed Gurmukhi script documents. Research in the field of segmentation of Gurmukhi script faces major problems m...
Whole-book recognition is a document image analysis strategy that operates on the complete set of a book’s page images, attempting to improve accuracy by automatic unsupervised ...
The meaning conveyed by documents and their markup often goes well beyond what can be inferred from the markup alone. It often depends on context, so that to interpret document ma...