In this work we present an approach to capture the total semantics in multimedia-multimodal web pages. Our research improves upon the state-ofthe-art with two key features: (1) cap...
We present a probabilistic model for a document corpus that combines many of the desirable features of previous models. The model is called “GaP” for Gamma-Poisson, the distri...
This paper describes a novel Bayesian approach to unsupervised topic segmentation. Unsupervised systems for this task are driven by lexical cohesion: the tendency of wellformed se...
The research field of "extracting knowledge bases from text collections" seems to be mature: its target and its working hypotheses are clear. In this paper we propose a ...
Certain distinctions made in the lexicon of one language may be redundant when translating into another language. We quantify redundancy among source types by the similarity of th...