Digital weight watching: reconstruction of scanned documents

13 years 4 months ago
Digital weight watching: reconstruction of scanned documents
A web-portal providing access to over 250.000 scanned and OCRed cultural heritage documents is analyzed. The collection consists of the complete Dutch Hansard from 1917 to 1995. Each web document consists of facsimile images of the original pages plus hidden OCRed text. The inclusion of images for each page yields large file sizes of which less than 2% is the actual text. The search user interface of the portal provides poor ranking and not very informative document summaries (snippets). Thus users are instrumental in weeding out nonrelevant results and for that have to assess the complete documents. This is a time-consuming and frustrating process because of the long download and processing times of the large files. Instead of using the complete document for relevance assessment we propose to use extended document summaries based on the OCRed text alone. We describe three kinds of summaries of increasing complexity. We elaborate on the most complex summary, a reconstruction of the or...
Tim Gielissen, Maarten Marx
Added 16 Feb 2011
Updated 16 Feb 2011
Type Journal
Year 2009
Where AND
Authors Tim Gielissen, Maarten Marx
Comments (0)