Sciweavers

ICDAR
2005
IEEE

A Corpus for Comparative Evaluation of OCR Software and Postcorrection Techniques

13 years 9 months ago
A Corpus for Comparative Evaluation of OCR Software and Postcorrection Techniques
We describe a new corpus collected for comparative evaluation of OCR-software and postcorrection techniques. The corpus is freely available for academic groups and use. The major part of the corpus (2306 files) consists of Bulgarian documents. Many of these documents come with Cyrillic and Latin symbols. A smaller corpus with German documents has been added. All original documents represent real-life paper documents collected from enterprises and organizations. Most genres of written language and various document types are covered. The corpus contains the corresponding image files, rich meta-data, textual files obtained via OCR recognition, ground truth data for hundreds of example pages, and alignment software for experiments.
Stoyan Mihov, Klaus U. Schulz, Christoph Ringlstet
Added 24 Jun 2010
Updated 24 Jun 2010
Type Conference
Year 2005
Where ICDAR
Authors Stoyan Mihov, Klaus U. Schulz, Christoph Ringlstetter, Veselka Dojchinova, Vanja Nakova
Comments (0)