Sciweavers

EMNLP
2009

Unsupervised Tokenization for Machine Translation

13 years 2 months ago
Unsupervised Tokenization for Machine Translation
Training a statistical machine translation starts with tokenizing a parallel corpus. Some languages such as Chinese do not incorporate spacing in their writing system, which creates a challenge for tokenization. Moreover, morphologically rich languages such as Korean present an even bigger challenge, since optimal token boundaries for machine translation in these languages are often unclear. Both rule-based solutions and statistical solutions are currently used. In this paper, we present unsupervised methods to solve tokenization problem. Our methods incorporate information available from parallel corpus to determine a good tokenization for machine translation.
Tagyoung Chung, Daniel Gildea
Added 17 Feb 2011
Updated 17 Feb 2011
Type Journal
Year 2009
Where EMNLP
Authors Tagyoung Chung, Daniel Gildea
Comments (0)