Sciweavers

AAAI
2006

Script and Language Identification in Degraded and Distorted Document Images

13 years 5 months ago
Script and Language Identification in Degraded and Distorted Document Images
This paper reports a statistical identification technique that differentiates scripts and languages in degraded and distorted document images. We identify scripts and languages through document vectorization, which transforms each document image into an electronic document vector that characterizes the shape and frequency of the contained character and word images. We first identify scripts based on the density and distribution of vertical runs between character strokes and a vertical scan line. Latin-based languages are then differentiated using a set of word shape codes constructed using horizontal word runs and character extremum points. Experimental results show that our method is tolerant to noise, document degradation, and slight document skew and attains an average identification rate over 95%.
Shijian Lu, Chew Lim Tan
Added 30 Oct 2010
Updated 30 Oct 2010
Type Conference
Year 2006
Where AAAI
Authors Shijian Lu, Chew Lim Tan
Comments (0)