Joint Tokenization and Translation

10 years 6 months ago
Joint Tokenization and Translation
As tokenization is usually ambiguous for many natural languages such as Chinese and Korean, tokenization errors might potentially introduce translation mistakes for translation systems that rely on 1-best tokenizations. While using lattices to offer more alternatives to translation systems have elegantly alleviated this problem, we take a further step to tokenize and translate jointly. Taking a sequence of atomic units that can be combined to form words in different ways as input, our joint decoder produces a tokenization on the source side and a translation on the target side simultaneously. By integrating tokenization and translation features in a discriminative framework, our joint decoder outperforms the baseline translation systems using 1-best tokenizations and lattices significantly on both ChineseEnglish and Korean-Chinese tasks. Interestingly, as a tokenizer, our joint decoder achieves significant improvements over monolingual Chinese tokenizers.
Xinyan Xiao, Yang Liu, Young-Sook Hwang, Qun Liu,
Added 13 May 2011
Updated 13 May 2011
Type Journal
Year 2010
Authors Xinyan Xiao, Yang Liu, Young-Sook Hwang, Qun Liu, Shouxun Lin
Comments (0)