Sciweavers

KDD
2010
ACM

Document clustering via dirichlet process mixture model with feature selection

13 years 2 months ago
Document clustering via dirichlet process mixture model with feature selection
One essential issue of document clustering is to estimate the appropriate number of clusters for a document collection to which documents should be partitioned. In this paper, we propose a novel approach, namely DPMFS, to address this issue. The proposed approach is designed 1) to group documents into a set of clusters while the number of document clusters is determined by the Dirichlet process mixture model automatically; 2) to identify the discriminative words and separate them from irrelevant noise words via stochastic search variable selection technique. We explore the performance of our proposed approach on both a synthetic dataset and several realistic document datasets. The comparison between our proposed approach and stage-of-the-art document clustering approaches indicates that our approach is robust and effective for document clustering. Categories and Subject Descriptors H.2.8 [Database Management]: Database Applications-Data Mining; I.5.3 [Pattern Recognition]: Clustering ...
Guan Yu, Ruizhang Huang, Zhaojun Wang
Added 14 Feb 2011
Updated 14 Feb 2011
Type Journal
Year 2010
Where KDD
Authors Guan Yu, Ruizhang Huang, Zhaojun Wang
Comments (0)