Sciweavers

ANLP
1994

Language Determination: Natural Language Processing from Scanned Document Images

14 years 6 days ago
Language Determination: Natural Language Processing from Scanned Document Images
Many documents are available to a computer only as images from paper. However, most natural language processing systems expect their input as character-coded text, which may be difficult or expensive to extract accurately from the page. We describe a method for converting a document image into character shape codes and word shape tokens. We believe that this representation, which is both cheap and robust, is sufficientfor many NLP tasks. In this paper, we show that the representation is sufficient for determining which of 23 languages the document is written in, using only a small number of features, with greater than 90% accuracy overall.
Penelope Sibun, A. Lawrence Spitz
Added 02 Nov 2010
Updated 02 Nov 2010
Type Conference
Year 1994
Where ANLP
Authors Penelope Sibun, A. Lawrence Spitz
Comments (0)