When scanning documents with a large number of pages such as books, it is often feasible to provide a minimal number of training samples to personalize the system to compensate fo...
With large databases of document images available, a method for users to find keywords in documents will be useful. One approach is to perform Optical Character Recognition (OCR) ...
With the wide adoption of XML as a standard data representation and exchange format, querying XML documents becomes increasingly important. However, relational database systems co...
Typographic and visual information is an integral part of textual documents. Most information extraction systems ignore most of this visual information, processing the text as a l...
(Automatic) document classification is generally defined as content-based assignment of one or more predefined categories to documents. Usually, machine learning, statistical patt...