Towards noise-resilient document modeling

11 years 3 months ago
Towards noise-resilient document modeling
We introduce a generative probabilistic document model based on latent Dirichlet allocation (LDA), to deal with textual errors in the document collection. Our model is inspired by the fact that most large-scale text data are machine-generated and thus inevitably contain many types of noise. The new model, termed as TE-LDA, is developed from the traditional LDA by adding a switch variable into the term generation process in order to tackle the issue of noisy text data. Through extensive experiments, the efficacy of our proposed model is validated using both real and synthetic data sets. Categories and Subject Descriptors
Tao Yang, Dongwon Lee
Added 13 Dec 2011
Updated 13 Dec 2011
Type Journal
Year 2011
Where CIKM
Authors Tao Yang, Dongwon Lee
Comments (0)