Sciweavers

KDD
2004
ACM

Probabilistic author-topic models for information discovery

14 years 4 months ago
Probabilistic author-topic models for information discovery
We propose a new unsupervised learning technique for extracting information from large text collections. We model documents as if they were generated by a two-stage stochastic process. Each author is represented by a probability distribution over topics, and each topic is represented as a probability distribution over words for that topic. The words in a multi-author paper are assumed to be the result of a mixture of each authors' topic mixture. The topic-word and author-topic distributions are learned from data in an unsupervised manner using a Markov chain Monte Carlo algorithm. We apply the methodology to a large corpus of abstracts and 85,000 authors from the well-known CiteSeer digital library, and learn a model with 300 topics. We discuss in detail the interpretation of the results discovered by the system including specific topic and author models, ranking of authors by topic and topics by author, significant trends in the computer science literature between 2002, parsing ...
Mark Steyvers, Padhraic Smyth, Michal Rosen-Zvi, T
Added 30 Nov 2009
Updated 30 Nov 2009
Type Conference
Year 2004
Where KDD
Authors Mark Steyvers, Padhraic Smyth, Michal Rosen-Zvi, Thomas L. Griffiths
Comments (0)