Text categorization by boosting automatically extracted concepts

10 years 3 months ago
Text categorization by boosting automatically extracted concepts
Term-based representations of documents have found widespread use in information retrieval. However, one of the main shortcomings of such methods is that they largely disregard lexical semantics and, as a consequence, are not sufficiently robust with respect to variations in word usage. In this paper we investigate the use of concept-based document representations to supplement word- or phrase-based features. The utilized concepts are automatically extracted from documents via probabilistic latent semantic analysis. We propose to use AdaBoost to optimally combine weak hypotheses based on both types of features. Experimental results on standard benchmarks confirm the validity of our approach, showing that AdaBoost achieves consistent improvements by including additional semantic features in the learned ensemble. Categories and Subject Descriptors H.3.1 [Information Storage and Retrieval]: Content Analysis and Indexing—Indexing Methods; H.3.3 [Information Storage and Retrieval]: Inf...
Lijuan Cai, Thomas Hofmann
Added 05 Jul 2010
Updated 05 Jul 2010
Type Conference
Year 2003
Authors Lijuan Cai, Thomas Hofmann
Comments (0)