Kernel PCA based clustering for inducing features in text categorization

13 years 6 months ago

Download www.dice.ucl.ac.be

We study dimensionality reduction or feature selection in text document categorization problem. We focus on the ﬁrst step in building text categorization systems, that is the choice of eﬃciently representing numerically the natural language text. This numerical representation is going to be used by machine learning algorithms. We propose a representation based on word clusters. We build a kernel matrix from the word distribution over the diﬀerent categories and apply kernel PCA to extract a low-dimensional representation of words. On this low-dimensional representation we use K-means clustering to group words into clusters and use these clusters subsequently in the document categorization task. We show that kernel PCA based clustering gives better or comparable performance than several advanced clustering methods when applied for the standard Reuters corpus.

Zsolt Minier, Lehel Csató

Real-time Traffic