Document clustering based on cluster validation

13 years 9 months ago

Download www.comp.nus.edu.sg

This paper presents a cluster validation based document clustering algorithm, which is capable of identifying both important feature words and true model order (cluster number). Important feature subset is selected by optimizing a cluster validity criterion subject to some constraint. For achieving model order identiﬁcation capability, this feature selection procedure is conducted for each possible value of cluster number. The feature subset and cluster number which maximize the cluster validity criterion are chosen as our answer. We have applied our algorithm to several datasets from 20Newsgroup corpus. Experimental results show that our algorithm can ﬁnd important feature subset, estimate the model order and yield higher micro-averaged precision than other four document clustering algorithms which require cluster number to be provided. Categories and Subject Descriptors H.3.3 [Information Search and Retrieval]: Clustering; I.5.2 [Design Methodology]: Feature evaluation and selec...

Zheng-Yu Niu, Dong-Hong Ji, Chew Lim Tan

Real-time Traffic