OCR with No Shape Training

9 years 7 months ago
OCR with No Shape Training
We present a document-specific OCR system and apply it to a corpus of faxed business letters. Unsupervised classification of the segmented character bitmaps on each page, using a "clump" metric, typically yields several hundred clusters with highly skewed populations. Letter identities are assigned to each cluster by maximizing matches with a lexicon of English words. We found that for 2/3 of the pages, we can identify almost 80% of the words included in the lexicon, without any shape training. Residual errors are caused by mis-segmentation including missed lines and punctuation. This research differs from earlier attempts to apply cipher decoding to OCR in (1) using real data (2) a more appropriate clustering algorithm, and (3) decoding a many-to-many instead of a one-to-one mapping between clusters and letters.
Tin Kam Ho, George Nagy
Added 09 Nov 2009
Updated 09 Nov 2009
Type Conference
Year 2000
Where ICPR
Authors Tin Kam Ho, George Nagy
Comments (0)