Sciweavers

Share
IJCAI
2007

Pseudo-Aligned Multilingual Corpora

8 years 7 months ago
Pseudo-Aligned Multilingual Corpora
In machine translation, document alignment refers to finding correspondences between documents which are exact translations of each other. We define pseudo-alignment as the task of finding topical—as opposed to exact—correspondences between documents in different languages. We apply semisupervised methods to pseudo-align multilingual corpora. Specifically, we construct a topicbased graph for each language. Then, given exact correspondences between a subset of documents, we project the unaligned documents into a shared lower-dimensional space. We demonstrate that close documents in this lower-dimensional space tend to share the same topic. This has applications in machine translation and cross-lingual information analysis. Experimental results show that pseudo-alignment of multilingual corpora is feasible and that the document alignments produced are qualitatively sound. Our technique requires no linguistic knowledge of the corpus. On average when 10% of the corpus consists of ...
Fernando Diaz, Donald Metzler
Added 29 Oct 2010
Updated 29 Oct 2010
Type Conference
Year 2007
Where IJCAI
Authors Fernando Diaz, Donald Metzler
Comments (0)
books