Sciweavers

ICDAR
2003
IEEE

Lexical Postcorrection of OCR-Results: The Web as a Dynamic Secondary Dictionary?

13 years 9 months ago
Lexical Postcorrection of OCR-Results: The Web as a Dynamic Secondary Dictionary?
Postcorrection of OCR-results for text documents is usually based on electronic dictionaries. When scanning texts from a specific thematic area, conventional dictionaries often miss a considerable number of tokens. Furthermore, if word frequencies are stored with the entries, these frequencies will not properly reflect the frequencies found in the given thematic area. Correction adequacy suffers from these two shortcomings. We report on a series of experiments where we compare (1) the use of fixed, static largescale dictionaries (including proper names and abbreviations) with (2) the use of dynamic dictionaries retrieved via an automated analysis of the vocabulary of web pages from a given domain, and (3) the use of mixed dictionaries. Our experiments, which address English and German document collections from a variety of fields, show that dynamic dictionaries of the above mentioned form can improve the coverage for the given thematic area in a significant way and help to improv...
Christian M. Strohmaier, Christoph Ringlstetter, K
Added 04 Jul 2010
Updated 04 Jul 2010
Type Conference
Year 2003
Where ICDAR
Authors Christian M. Strohmaier, Christoph Ringlstetter, Klaus U. Schulz, Stoyan Mihov
Comments (0)