Recent initiatives like the Million Book Project and Google Print Library Project have already archived several million books in digital format, and within a few years a significa...
Xiaoyue Wang, Lexiang Ye, Eamonn J. Keogh, Christi...
We present Opal, a light-weight framework for interactively locating missing web pages (http status code 404). Opal is an example of “in vivo” preservation: harnessing the col...
We describe an HTML web page segmentation algorithm, which is applied to segment online medical journal articles (regular HTML and PDF-Converted-HTML files). The web page content ...
Large-scale digitization projects aimed at periodicals often have as input streams of completely unlabeled document images. In such situations, the results produced by the automat...
Iuliu Vasile Konya, Christoph Seibert, Sebastian G...
The standard method for making the full content of audio and video material searchable and is to annotate it with humangenerated meta-data that describes the content in a way that...