Sciweavers

DRR
2003

Correcting OCR text by association with historical datasets

13 years 5 months ago
Correcting OCR text by association with historical datasets
The Medical Article Records System (MARS) developed by the Lister Hill National Center for Biomedical Communications uses scanning, OCR and automated recognition and reformatting algorithms to generate electronic bibliographic citation data from paper biomedical journal articles. The multi-engine OCR server incorporated in MARS performs well in general, but fares less well with text printed in small or italic fonts. Affiliations are often printed in small italic fonts in the journals processed by MARS. Consequently, although the automatic processes generate much of the citation data correctly, the affiliation field frequently contains incorrect data, which must be manually corrected by verification operators. In contrast, author names are usually printed in large, normal fonts that are correctly converted to text by the OCR server. The National Library of Medicine's MEDLINE® database contains 11 million indexed citations for biomedical journal articles. This paper documents our ...
Susan E. Hauser, Jonathan Schlaifer, Tehseen F. Sa
Added 31 Oct 2010
Updated 31 Oct 2010
Type Conference
Year 2003
Where DRR
Authors Susan E. Hauser, Jonathan Schlaifer, Tehseen F. Sabir, Dina Demner-Fushman, Scott Straughan, George R. Thoma
Comments (0)