A Self-Adaptive Method for Extraction of Document-Specific Alphabets

13 years 11 months ago

Download www.cse.salford.ac.uk

Recognition and encoding of digitized historical documents is still a challenging and difficult task. A major problem is the occurrence of unknown glyphs and symbols which might not even exist in modern alphabets. Current pre-trained OCR-methods hardly deliver usable results for such documents. This paper describes an alternative approach and framework for handling printed historical documents without restrictions on the contained alphabets or fonts. The basic idea is to derive all information required for encoding directly from the document itself. This is achieved by extracting a document-specific prototype alphabet of locatable glyphs. Core of the system is a customized clustering method which adapts automatically to new documents by ascertaining appropriate threshold parameters based on the special characteristics of glyphs. This way, the system is able to run without manual interventions and can be integrated into automated mass digitization workflows.

Stefan Pletschacher

Real-time Traffic

Digitized Historical Documents | Document Analysis | Historical Documents | ICDAR 2009 | Unknown Glyphs |

claim paper

Post Info
More Details (n/a)

Added	21 May 2010
Updated	21 May 2010
Type	Conference
Year	2009
Where	ICDAR
Authors	Stefan Pletschacher

Comments (0)

Sciweavers

A Self-Adaptive Method for Extraction of Document-Specific Alphabets

Digitized Historical Documents | Document Analysis | Historical Documents | ICDAR 2009 | Unknown Glyphs |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers