Indexing and retrieval of words in old documents

15 years 10 months ago

Download www.dsi.unifi.it

This paper describes a system for eﬃcient indexing and retrieval of words in collections of document images. The proposed method is based on two main principles: unsupervised prototype clustering, and string encoding for eﬃcient string matching. During indexing, a self organizing map (SOM) is trained so as to cluster together similar symbols (character-like objects) in a sub-set of the documents to be stored. By using the trained SOM the words in the whole collection can be stored and represented with a ﬁxed-length description, that can be easily compared in order to score most similar words in response to a user query. The system can be automatically adapted to diﬀerent languages and font styles. The most appropriate applications are for the processing of old documents (18th and 19th Centuries) where current OCRs have more diﬃculties. Experimental results describe three application scenarios having various levels of diﬃculty for current OCR systems.

Simone Marinai, Emanuele Marino, Giovanni Soda

Real-time Traffic