Document Representation and Dimension Reduction for Text Clustering

13 years 10 months ago

Download torch.cs.dal.ca

Increasingly large text datasets and the high dimensionality associated with natural language create a great challenge in text mining. In this research, a systematic study is conducted, in which three different document representation methods for text are used, together with three Dimension Reduction Techniques (DRT), in the context of the text clustering problem. Several standard benchmark datasets are used. The three Document representation methods considered are based on the vector space model, and they include word, multi-word term, and character N-gram representations. The dimension reduction methods are independent component analysis (ICA), latent semantic indexing (LSI), and a feature selection technique based on Document Frequency (DF). Results are compared in terms of clustering performance, using the k-means clustering algorithm. Experiments show that ICA and LSI are clearly better than DF on all datasets. For word and N-gram representation, ICA generally gives better result...

M. Mahdi Shafiei, Singer Wang, Roger Zhang, Evange

Real-time Traffic

Database | Dimension Reduction | Document Representation Methods | ICDE 2007 | N-gram Representation |

claim paper

» Bursty Feature Representation for Clustering Text Streams

» Kernel PCA based clustering for inducing features in text categorization

» Horizontal Reduction InstanceLevel Dimensionality Reduction for Similarity Search in Large...

» Efficient PredictionBased Validation for Document Clustering

» Mining Clustering Dimensions

» Clustered SubMatrix Singular Value Decomposition

» Concept Chain Based Text Clustering

» Text Clustering on Latent Thematic Spaces Variants Strengths and Weaknesses

Post Info
More Details (n/a)

Added	03 Jun 2010
Updated	03 Jun 2010
Type	Conference
Year	2007
Where	ICDE
Authors	M. Mahdi Shafiei, Singer Wang, Roger Zhang, Evangelos E. Milios, Bin Tang, Jane Tougas, Raymond J. Spiteri

Comments (0)

Sciweavers

Document Representation and Dimension Reduction for Text Clustering

Database | Dimension Reduction | Document Representation Methods | ICDE 2007 | N-gram Representation |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers