Sciweavers

LREC
2010

Data Collection and IPR in Multilingual Parallel Corpora. Dutch Parallel Corpus

13 years 6 months ago
Data Collection and IPR in Multilingual Parallel Corpora. Dutch Parallel Corpus
After three years of work the Dutch Parallel Corpus (DPC) project has reached an end. The finalized corpus is a ten-million-word high-quality sentence-aligned bidirectional parallel corpus of Dutch, English and French, with Dutch as central language. In this paper we present the corpus and try to formulate some basic data collection principles, based on the work that was carried out for the project. Building a corpus is a difficult and time-consuming task, especially when every text sample included has to be cleared from copyrights. The DPC is balanced according to five text types (literature, journalistic texts, instructive texts, administrative texts and texts treating external communication) and four translation directions (Dutch-English, English-Dutch, Dutch-French and French-Dutch). All the text material was cleared from copyrights. The data collection process necessitated the involvement of different text providers, which resulted in drawing up four different licence agreements....
Orphée De Clercq, Maribel Montero Perez
Added 29 Oct 2010
Updated 29 Oct 2010
Type Conference
Year 2010
Where LREC
Authors Orphée De Clercq, Maribel Montero Perez
Comments (0)