Sciweavers

SIGIR
2002
ACM

Unsupervised document classification using sequential information maximization

13 years 4 months ago
Unsupervised document classification using sequential information maximization
We present a novel sequential clustering algorithm which is motivated by the Information Bottleneck (IB) method. In contrast to the agglomerative IB algorithm, the new sequential (sIB) approach is guaranteed to converge to a local maximum of the information, as required by the original IB principle. Moreover, the time and space complexity are significantly improved. We apply this algorithm to unsupervised document classification. In our evaluation, on small and medium size corpora, the sIB is found to be consistently superior to all the other clustering methods we examine, typically by a significant margin. Moreover, the sIB results are comparable to those obtained by a supervised Naive Bayes classifier. Finally, we propose a simple procedure for trading cluster's recall to gain higher precision, and show how this approach can extract clusters which match the existing topics of the corpus almost perfectly. Categories and Subject Descriptors I.5.3 [Pattern Recognition]: Clustering...
Noam Slonim, Nir Friedman, Naftali Tishby
Added 23 Dec 2010
Updated 23 Dec 2010
Type Journal
Year 2002
Where SIGIR
Authors Noam Slonim, Nir Friedman, Naftali Tishby
Comments (0)