Sciweavers

ECIR
2010
Springer

Extracting Multilingual Topics from Unaligned Comparable Corpora

13 years 5 months ago
Extracting Multilingual Topics from Unaligned Comparable Corpora
Topic models have been studied extensively in the context of monolingual corpora. Though there are some attempts to mine topical structure from cross-lingual corpora, they require clues about document alignments. In this paper we present a generative model called JointLDA which uses a bilingual dictionary to mine multilingual topics from an unaligned corpus. Experiments conducted on different data sets confirm our conjecture that jointly modeling the cross-lingual corpora offers several advantages compared to individual monolingual models. Since the JointLDA model merges related topics in different languages into a single multilingual topic: a) it can fit the data with relatively fewer topics. b) it has the ability to predict related words from a language different than that of the given document. In fact it has better predictive power compared to the bag-of-word based translation model leaving the possibility for JointLDA to be preferred over bag-of-word model for cross-lingual IR app...
Jagadeesh Jagarlamudi, Hal Daumé III
Added 29 Oct 2010
Updated 29 Oct 2010
Type Conference
Year 2010
Where ECIR
Authors Jagadeesh Jagarlamudi, Hal Daumé III
Comments (0)