Rare Word Translation Extraction from Aligned Comparable Documents

8 years 10 months ago
We present a first known result of high precision rare word bilingual extraction from comparable corpora, using aligned comparable documents and supervised classification. We incorporate two features, a context-vector similarity and a co-occurrence model between words in aligned documents in a machine learning approach. We test our hypothesis on different pairs of languages and corpora. We obtain very high F-Measure between 80% and 98% for recognizing and extracting correct translations for rare terms (from 1 to 5 occurrences). Moreover, we show that our system can be trained on a pair of languages and test on a different pair of languages, obtaining a F-Measure of 77% for the classification of Chinese-English translations using a training corpus of Spanish-French. Our method is therefore even applicable to low languages without training data.
Emmanuel Prochasson, Pascale Fung
Added 23 Aug 2011
Updated 23 Aug 2011
Type Journal
Year 2011
Where ACL
