Pattern discovery in sequences under a Markov assumption

9 years 9 months ago
Pattern discovery in sequences under a Markov assumption
In this paper we investigate the general problem of discovering recurrent patterns that are embedded in categorical sequences. An important real-world problem of this nature is motif discovery in DNA sequences. We investigate the fundamental aspects of this data mining problem that can make discovery "easy" or "hard." We present a general framework for characterizing learning in this context by deriving the Bayes error rate for this problem under a Markov assumption. The Bayes error framework demonstrates why certain patterns are much harder to discover than others. It also explains the role of different parameters such as pattern length and pattern frequency in sequential discovery. We demonstrate how the Bayes error can be used to calibrate existing discovery algorithms, providing a lower bound on achievable performance. We discuss a number of fundamental issues that characterize sequential pattern discovery in this context, present a variety of empirical results...
Darya Chudova, Padhraic Smyth
Added 30 Nov 2009
Updated 30 Nov 2009
Type Conference
Year 2002
Where KDD
Authors Darya Chudova, Padhraic Smyth
Comments (0)