Sciweavers

JMLR
2012

Bounding the Probability of Error for High Precision Optical Character Recognition

11 years 6 months ago
Bounding the Probability of Error for High Precision Optical Character Recognition
We consider a model for which it is important, early in processing, to estimate some variables with high precision, but perhaps at relatively low recall. If some variables can be identified with near certainty, they can be conditioned upon, allowing further inference to be done efficiently. Specifically, we consider optical character recognition (OCR) systems that can be bootstrapped by identifying a subset of correctly translated document words with very high precision. This “clean set” is subsequently used as document-specific training data. While OCR systems produce confidence measures for the identity of each letter or word, thresholding these values still produces a significant number of errors. We introduce a novel technique for identifying a set of correct words with very high precision. Rather than estimating posterior probabilities, we bound the probability that any given word is incorrect using an approximate worst case analysis. We give empirical results on a data...
Gary B. Huang, Andrew Kae, Carl Doersch, Erik G. L
Added 27 Sep 2012
Updated 27 Sep 2012
Type Journal
Year 2012
Where JMLR
Authors Gary B. Huang, Andrew Kae, Carl Doersch, Erik G. Learned-Miller
Comments (0)