Sciweavers

Share
EMNLP
2009

Unsupervised Tokenization for Machine Translation

11 years 3 months ago
Unsupervised Tokenization for Machine Translation
Training a statistical machine translation starts with tokenizing a parallel corpus. Some languages such as Chinese do not incorporate spacing in their writing system, which creates a challenge for tokenization. Moreover, morphologically rich languages such as Korean present an even bigger challenge, since optimal token boundaries for machine translation in these languages are often unclear. Both rule-based solutions and statistical solutions are currently used. In this paper, we present unsupervised methods to solve tokenization problem. Our methods incorporate information available from parallel corpus to determine a good tokenization for machine translation.
Tagyoung Chung, Daniel Gildea
Added 17 Feb 2011
Updated 17 Feb 2011
Type Journal
Year 2009
Where EMNLP
Authors Tagyoung Chung, Daniel Gildea
Comments (0)
books