Representing OCRed documents in HTML

9 years 6 months ago
Representing OCRed documents in HTML
ABSTRACT: OCR is an error-prone process. It is time-consuming and expensive to manually proofread OCR results. The errors remaining in OCRed texts can cause serious problems in reading and understanding if they do not refer to the original image representation. As demonstrated in this paper, a hybrid document which combines symbolic representation and image representation may relieve the problem. If we represent a OCRed document properly in HTML, OCR errors will not have much negative e ect on the human reading process in a HTML browser and can be corrected by using a HTML authoring tool. Under the approach, an experiment evaluating a Japanese OCR system developed in CEDAR is also reported in this paper. 1 Overview of the Approach OCR is a process to transform a given document from its image representation into its symbolic representation. After this process, we obtain a text document which is electronically searchable, indexable and reusable. However, the transformation is error-prone...
Tao Hong, Sargur N. Srihari
Added 06 Aug 2010
Updated 06 Aug 2010
Type Conference
Year 1997
Authors Tao Hong, Sargur N. Srihari
Comments (0)