Efficient Phrase-Based Document Similarity for Clustering

10 years 6 months ago
Efficient Phrase-Based Document Similarity for Clustering
Phrase has been considered as a more informative feature term for improving the effectiveness of document clustering. In this paper, we propose a phrase-based document similarity to compute the pairwise similarities of documents based on the Suffix Tree Document (STD) model. By mapping each node in the suffix tree of STD model into a unique feature term in the Vector Space Document (VSD) model, the phrase-based document similarity naturally inherits the term tf-idf weighting scheme in computing the document similarity with phrases. We apply the phrase-based document similarity to the group-average Hierarchical Agglomerative Clustering (HAC) algorithm and develop a new document clustering approach. Our evaluation experiments indicate that the new clustering approach is very effective on clustering the documents of two standard document benchmark corpora OHSUMED and
Hung Chim, Xiaotie Deng
Added 15 Dec 2010
Updated 15 Dec 2010
Type Journal
Year 2008
Where TKDE
Authors Hung Chim, Xiaotie Deng
Comments (0)