Commercial OCR packages work best with highquality scanned images. They often produce poor results when the image is degraded, either because the original itself was poor quality,...
Large quantities of documents in the Internet and digital libraries are simply scanned and archived in image format, many of which are packed in PDF files. The word search tool pr...
We describe an HTML web page segmentation algorithm, which is applied to segment online medical journal articles (regular HTML and PDF-Converted-HTML files). The web page content ...
Abstract: The data on the web, in digital libraries, in scientific repositories, etc. continues to grow at an increasing rate. Distribution is a key solution to overcome this data...
Fabian Groffen, Martin L. Kersten, Stefan Manegold
Abstract. Recent work on analyzing query logs shows that a significant fraction of queries are temporal, i.e., relevancy is dependent on time, and temporal queries play an importan...