A Comparative Study on Feature Selection in Text Categorization

10 years 10 months ago
A Comparative Study on Feature Selection in Text Categorization
This paper is a comparative study of feature selection methods in statistical learning of text categorization. The focus is on aggressive dimensionality reduction. Five methods were evaluated, including term selection based on document frequency (DF), information gain (IG), mutual information (MI), a 2-test (CHI), and term strength (TS). We found IG and CHI most e ective in our experiments. Using IG thresholding with a knearest neighbor classi er on the Reuters corpus, removal of up to 98% removal of unique terms actually yielded an improved classi cation accuracy (measured by average precision). DF thresholding performed similarly. Indeed we found strong correlations between the DF, IG and CHI values of a term. This suggests that DF thresholding, the simplest method with the lowest cost in computation, can be reliably used instead of IG or CHI when the computation of these measures are too expensive. TS compares favorably with the other methodswithupto50%vocabulary reduction but is n...
Yiming Yang, Jan O. Pedersen
Added 06 Aug 2010
Updated 06 Aug 2010
Type Conference
Year 1997
Where ICML
Authors Yiming Yang, Jan O. Pedersen
Comments (0)