A new suffix tree similarity measure for document clustering

12 years 10 days ago
A new suffix tree similarity measure for document clustering
In this paper, we propose a new similarity measure to compute the pairwise similarity of text-based documents based on suffix tree document model. By applying the new suffix tree similarity measure in Group-average Agglomerative Hierarchical Clustering (GAHC) algorithm, we developed a new suffix tree document clustering algorithm (NSTC). Experimental results on two standard document clustering benchmark corpus OHSUMED and RCV1 indicate that the new clustering algorithm is a very effective document clustering algorithm. Comparing with the results of traditional word term weight tf-idf similarity measure in the same GAHC algorithm, NSTC achieved an improvement of 51% on the average of F-measure score. Furthermore, we apply the new clustering algorithm in analyzing the Web documents in online forum communities. A topic oriented clustering algorithm is developed to help people in assessing, classifying and searching the the Web documents in a large forum community. Categories and Subject ...
Hung Chim, Xiaotie Deng
Added 22 Nov 2009
Updated 22 Nov 2009
Type Conference
Year 2007
Where WWW
Authors Hung Chim, Xiaotie Deng
Comments (0)