Pseudo-Aligned Multilingual Corpora

13 years 5 months ago
Pseudo-Aligned Multilingual Corpora
In machine translation, document alignment refers to finding correspondences between documents which are exact translations of each other. We define pseudo-alignment as the task of finding topical—as opposed to exact—correspondences between documents in different languages. We apply semisupervised methods to pseudo-align multilingual corpora. Specifically, we construct a topicbased graph for each language. Then, given exact correspondences between a subset of documents, we project the unaligned documents into a shared lower-dimensional space. We demonstrate that close documents in this lower-dimensional space tend to share the same topic. This has applications in machine translation and cross-lingual information analysis. Experimental results show that pseudo-alignment of multilingual corpora is feasible and that the document alignments produced are qualitatively sound. Our technique requires no linguistic knowledge of the corpus. On average when 10% of the corpus consists of ...
Fernando Diaz, Donald Metzler
Added 29 Oct 2010
Updated 29 Oct 2010
Type Conference
Year 2007
Authors Fernando Diaz, Donald Metzler
Comments (0)