We describe a joint probabilistic model for modeling the contents and inter-connectivity of document collections such as sets of web pages or research paper archives. The model is...
: This article is a revised and extended version of [VBG, 07]. We conjecture that the digitalization of historical text documents as a basis of data mining and information retrieva...
This paper studies the problem of extracting data from a Web page that contains several structured data records. The objective is to segment these data records, extract data items...
Exact substring matching queries on large data collections can be answered using q-gram indices, that store for each occurring q-byte pattern an (ordered) posting list with the po...
Ontology-driven search applications use ontological concepts either to index documents or to guide and understand the users. Since ontologies by nature are domain-dependent and app...