Sciweavers

BIBM
2010
IEEE

Probabilistic topic modeling for genomic data interpretation

13 years 1 months ago
Probabilistic topic modeling for genomic data interpretation
Recently, the concept of a species containing both core and distributed genes, known as the supra- or pangenome theory, has been introduced. In this paper, we aim to develop a new method that is able to analyze the genome-level composition of DNA sequences, in order to characterize a set of common genomic features shared by the same species and tell their functional roles. To achieve this end, we firstly apply a composition-based approach to break down DNA sequences into sub-reads called the `N-mer' and represent the sequences by N-mer frequencies. Then, we introduce the Latent Dirichlet Allocation (LDA) model to study the genome-level statistic patterns (a.k.a. latent topics) of the `N-mer' features. Each estimated latent topic represents a certain component of the whole genome. With the help of the BioJava toolkit, we access to the gene region information of reference sequences from the NCBI database. We use our data mining framework to investigate two areas: 1) do strains ...
Xin Chen, Xiaohua Hu, Xiajiong Shen, Gail Rosen
Added 10 Feb 2011
Updated 10 Feb 2011
Type Journal
Year 2010
Where BIBM
Authors Xin Chen, Xiaohua Hu, Xiajiong Shen, Gail Rosen
Comments (0)