This paper presents an efficient compression-oriented segmentation algorithm for computer-generated document images. In this algorithm, a document image is represented in a block-...
A n off-line hand-written Chinese character recognizer based on Contextual Vector Quantization (CVQ) supporting a vocabulary of 4,616 Chinese characters, alphanumerics and punctua...
What makes template content in the Web so special that we need to remove it? In this paper I present a large-scale aggregate analysis of textual Web content, corroborating statist...
This paper presents SOFIE, a system for automated ontology extension. SOFIE can parse natural language documents, extract ontological facts from them and link the facts into an on...
In a new model for answer retrieval, document collections are distilled offline into large repositories of facts. Each fact constitutes a potential direct answer to questions seek...