Separating machine printed text and handwriting from overlapping text is a challenging problem in the document analysis field and no reliable algorithms have been developed thus f...
—This paper describes embedding a mathematical formula recognition module into the OCR system OCRopus aiming at developing a OCR system for scientific and technical documents wh...
Multi-document discourse analysis has emerged with the potential of improving various NLP applications. Based on the newly proposed Cross-document Structure Theory (CST), this pap...
A new approach for constructing pseudo-keywords, referred to as Sense Units, is proposed. Sense Units are obtained by a word clustering process, where the underlying similarity re...
PLANDoc, a system under joint development by Columbia and Bellcore, documents the activity of planning engineers as they study telephone routes. It takes as input a trace of the e...