CLUSEQ: Efficient and Effective Sequence Clustering

14 years 4 months ago
CLUSEQ: Efficient and Effective Sequence Clustering
Analyzing sequence data has become increasingly important recently in the area of biological sequences, text documents, web access logs, etc. In this paper, we investigate the problem of clustering sequences based on their structural features. As a widely recognized technique, clustering has proven to be very useful in detecting unknown object categories and revealing hidden correlations among objects. One difficulty that prevents clustering from being performed extensively on sequence data (in categorical domain) is the lack of an effective yet efficient similarity measure. Therefore, we propose a novel model (CLUSEQ) for sequence cluster by exploring significant statistical properties possessed by the sequences. The conditional probability distribution (CPD) of the next symbol given a preceding segment is derived and used to characterize sequence behavior and to support the similarity measure. A variation of the suffix tree, namely probabilistic suffix tree, is employed to organize ...
Jiong Yang, Wei Wang 0010
Added 01 Nov 2009
Updated 01 Nov 2009
Type Conference
Year 2003
Where ICDE
Authors Jiong Yang, Wei Wang 0010
Comments (0)