Sciweavers

LREC
2010

Automatic Acquisition of Parallel Corpora from Websites with Dynamic Content

13 years 5 months ago
Automatic Acquisition of Parallel Corpora from Websites with Dynamic Content
Parallel corpora are indispensable resources for a variety of multilingual natural language processing tasks. This paper presents a technique for fully automatic construction of constantly growing parallel corpora. We propose a simple and effective dictionary-based algorithm to extract parallel document pairs from a large collection of articles retrieved from the Internet, potentially containing manually translated texts. This algorithm was implemented and tested on Hebrew-English parallel texts. With properly selected thresholds, precision of 100% can be obtained.
Yulia Tsvetkov, Shuly Wintner
Added 29 Oct 2010
Updated 29 Oct 2010
Type Conference
Year 2010
Where LREC
Authors Yulia Tsvetkov, Shuly Wintner
Comments (0)