The domain of Digital Libraries presents specific challenges for unsupervised information extraction to support both the automatic classification of documents and the enhancement ...
Mikalai Krapivin, Maurizio Marchese, Andrei Yadran...
In addition to the actual content Web pages consist of navigational elements, templates, and advertisements. This boilerplate text typically is not related to the main content, ma...
This paper presents a new enhanced text extraction algorithm from degraded document images on the basis of the probabilistic models. The observed document image is considered as a...
In this paper, we propose a new application of Bayesian language model based on Pitman-Yor process for information retrieval. This model is a generalization of the Dirichlet distr...
This paper describes the first large-scale evaluation of information retrieval systems using Dutch documents and queries. We describe in detail the characteristics of the Dutch te...