Towards Using Fewer Features for Text Classification

12 years 3 months ago
Towards Using Fewer Features for Text Classification
Abstract-- Text classification or categorization is a conventional classification problem applied to the text domain. In the cases when statistical classification methods are used, an important research issue is the selection of features from the training texts, each of which is hence treated as a feature vector. In this paper, we propose an approach for feature selection in text classification tasks, based on the exploit of external information that summarizes the text to be classified. In particular, we study the use of their citation contexts in the categorization of academic publications using the Naive Bayesian method. A series of experiments have been performed on a corpus of publications in Computer Science, based on which we observe that publication citation contexts can serve as a liable and effective source of feature selection. We also derive some useful hints on the reduction of feature number with a negligible affects on the accuracies.1
Yuan Yuan, Tianyang Gu
Added 30 Oct 2010
Updated 30 Oct 2010
Type Conference
Year 2006
Where DMIN
Authors Yuan Yuan, Tianyang Gu
Comments (0)