Sciweavers

Free Online Productivity Tools i2Speak i2Symbol i2OCR iTex2Img iWeb2Print iWeb2Shot i2Type iPdf2Split iPdf2Merge i2Bopomofo i2Arabic i2Style i2Image i2PDF iLatex2Rtf Sci2ools

10

ANLP
1994

favoriteEmaildiscussreport

104views more ANLP 1994»

Language Determination: Natural Language Processing from Scanned Document Images

13 years 5 months ago

Language Determination: Natural Language Processing from Scanned Document Images

Download acl.ldc.upenn.edu

Many documents are available to a computer only as images from paper. However, most natural language processing systems expect their input as character-coded text, which may be difficult or expensive to extract accurately from the page. We describe a method for converting a document image into character shape codes and word shape tokens. We believe that this representation, which is both cheap and robust, is sufficientfor many NLP tasks. In this paper, we show that the representation is sufficient for determining which of 23 languages the document is written in, using only a small number of features, with greater than 90% accuracy overall.

Penelope Sibun, A. Lawrence Spitz

Real-time Traffic

ANLP 1994 | Document | Natural Language Processing | Word Shape Tokens |

claim paper

Related Content

» Web document text and images extraction using DOM analysis and natural language processing

» Sentiment Analyzer Extracting Sentiments about a Given Topic using Natural Language Proces...

» A Nonnegative Matrix Factorization Based Approach for Active Dual Supervision from Documen...

» Kernelized Sorting for Natural Language Processing

» Keyphrases Extraction from Scientific Documents Improving Machine Learning Approaches with...

» Extracting significant words from corpora for ontology extraction

» Unwarping scanned image of JapaneseEnglish documents

» Automatic SingleDocument Key Fact Extraction from Newswire Articles

» Extreme value theory applied to document retrieval from large collections

Post Info
More Details (n/a)

Added	02 Nov 2010
Updated	02 Nov 2010
Type	Conference
Year	1994
Where	ANLP
Authors	Penelope Sibun, A. Lawrence Spitz

Comments (0)