Sciweavers

COLING
2010
12 years 12 months ago
Joint Tokenization and Translation
As tokenization is usually ambiguous for many natural languages such as Chinese and Korean, tokenization errors might potentially introduce translation mistakes for translation sy...
Xinyan Xiao, Yang Liu, Young-Sook Hwang, Qun Liu, ...
EMNLP
2009
13 years 2 months ago
Unsupervised Tokenization for Machine Translation
Training a statistical machine translation starts with tokenizing a parallel corpus. Some languages such as Chinese do not incorporate spacing in their writing system, which creat...
Tagyoung Chung, Daniel Gildea
IR
2007
13 years 4 months ago
An empirical study of tokenization strategies for biomedical information retrieval
Due to the great variation of biological names in biomedical text, appropriate tokenization is an important preprocessing step for biomedical information retrieval. Despite its im...
Jing Jiang, ChengXiang Zhai
NLE
2008
118views more  NLE 2008»
13 years 4 months ago
Part-of-speech tagging of Modern Hebrew text
Words in Semitic texts often consist of a concatenation of word segments, each corresponding to a Part-of-Speech (POS) category. Semitic words may be ambiguous with regard to thei...
Roy Bar-Haim, Khalil Sima'an, Yoad Winter
ACL
1998
13 years 6 months ago
One Tokenization per Source
We report in this paper the observation of one tokenization per source. That is, the same critical fragment in different sentences from the same source almost always realize one a...
Jin Guo
LREC
2008
88views Education» more  LREC 2008»
13 years 6 months ago
A Trainable Tokenizer, solution for multilingual texts and compound expression tokenization
Tokenization is one of the initial steps done for almost any text processing task. It is not particularly recognized as a challenging task for English monolingual systems but it r...
Oana Frunza