Document clustering via dirichlet process mixture model with feature selection

15 years 1 months ago

Download math.nankai.edu.cn

One essential issue of document clustering is to estimate the appropriate number of clusters for a document collection to which documents should be partitioned. In this paper, we propose a novel approach, namely DPMFS, to address this issue. The proposed approach is designed 1) to group documents into a set of clusters while the number of document clusters is determined by the Dirichlet process mixture model automatically; 2) to identify the discriminative words and separate them from irrelevant noise words via stochastic search variable selection technique. We explore the performance of our proposed approach on both a synthetic dataset and several realistic document datasets. The comparison between our proposed approach and stage-of-the-art document clustering approaches indicates that our approach is robust and effective for document clustering. Categories and Subject Descriptors H.2.8 [Database Management]: Database Applications-Data Mining; I.5.3 [Pattern Recognition]: Clustering ...

Guan Yu, Ruizhang Huang, Zhaojun Wang

Real-time Traffic