Sciweavers

ICPR
2008
IEEE

A robust technique for text extraction in mixed-type binary documents

13 years 10 months ago
A robust technique for text extraction in mixed-type binary documents
A crucial preprocessing stage in applications such as OCR is text extraction from mixed-type documents. The present work, in contrast to most until now, successfully faces the problem of varying text orientation and size. The technique first identifies marks using a contour following technique, followed by a PCA (Principal Component Analyzer) which determines the direction of the main axis of each mark. Next, a nearest-neighbor technique is employed to find the shortest distances between marks, and a feature vector is formed based on calculated mark dimensions and distances, which is then fed into a SOFM (Self Organizing Feature Map) which defines homogeneous mark clusters. Resulting cluster weights and variances are used to form a set of fuzzy rules, and a fuzzy classification scheme identifies marks as characters or non-characters. The technique succeeds in correctly and quickly extracting text areas in a variety of mixed-type documents.
Charalambos Strouthopoulos, Athanasios Nikolaidis
Added 30 May 2010
Updated 30 May 2010
Type Conference
Year 2008
Where ICPR
Authors Charalambos Strouthopoulos, Athanasios Nikolaidis
Comments (0)