Retrieval of machine-printed Latin documents through Word Shape Coding

11 years 1 months ago
Retrieval of machine-printed Latin documents through Word Shape Coding
This paper reports a document retrieval technique that retrieves machine-printed Latin-based document images through word shape coding. Adopting the idea of image annotation, a word shape coding scheme is proposed, which converts each word image into a word shape code by using a few shape features. The text contents of imaged documents are thus captured by a document vector constructed with the converted word shape code and word frequency information. Similarities between different document images are then gauged based on the constructed document vectors. We divide the retrieval process into two stages. Based on the observation that documents of the same language share a large number of high-frequency languagespecific stop words, the first stage retrieves documents with the same underlying language as that of the query document. The second stage then re-ranks the documents retrieved in the first stage based on the topic similarity. Experiments show that document images of different la...
Shijian Lu, Chew Lim Tan
Added 14 Dec 2010
Updated 14 Dec 2010
Type Journal
Year 2008
Where PR
Authors Shijian Lu, Chew Lim Tan
Comments (0)