Combining clustering and co-training to enhance text classification using unlabelled data

14 years 4 months ago

Download users.rsise.anu.edu.au

In this paper, we present a new co-training strategy that makes use of unlabelled data. It trains two predictors in parallel, with each predictor labelling the unlabelled data for training the other predictor in the next round. Both predictors are support vector machines, one trained using data from the original feature space, the other trained with new features that are derived by clustering both the labelled and unlabelled data. Hence, unlike standard co-training methods, our method does not require a priori the existence of two redundant views either of which can be used for classification, nor is it dependent on the availability of two different supervised learning algorithms that complement each other. We evaluated our method with two classifiers and three text benchmarks: WebKB, Reuters newswire articles and 20 NewsGroups. Our evaluation shows that our co-training technique improves text classification accuracy especially when the number of labelled examples are very few.

Bhavani Raskutti, Herman L. Ferrá, Adam Kow

Real-time Traffic