On the use of linear programming for unsupervised text classification

16 years 3 months ago

Download www.cs.cornell.edu

We propose a new algorithm for dimensionality reduction and unsupervised text classification. We use mixture models as underlying process of generating corpus and utilize a novel, L1-norm based approach introduced by Kleinberg and Sandler [19]. We show that our algorithm performs extremely well on large datasets, with peak accuracy approaching that of supervised learning based on Support Vector Machines with large training sets. The method is based on the same idea that underlies Latent Semantic Indexing (LSI). We find a good low-dimensional subspace of a feature space and project all documents into it. However our projection minimizes different error, and unlike LSI we build a basis, that in many cases corresponds to the actual topics. We present the testing results of rithm on the abstracts of arXiv- an electronic repository of scientific papers, and the 20 Newsgroup dataset - a small snapshot of 20 specific newsgroups. Categories and Subject Descriptors H.3.3 [Information Storage a...

Mark Sandler

Real-time Traffic