tokenization | Sciweavers

26

COLING
2010

104views Computational Linguistics» more COLING 2010»

13 years 4 months ago

As tokenization is usually ambiguous for many natural languages such as Chinese and Korean, tokenization errors might potentially introduce translation mistakes for translation sy...

Xinyan Xiao, Yang Liu, Young-Sook Hwang, Qun Liu, ...

claim paper

Read More »

29

click to vote

EMNLP
2009

133views Natural Language Processing» more EMNLP 2009»

Unsupervised Tokenization for Machine Translation

13 years 7 months ago

Download www.cs.rochester.edu

Training a statistical machine translation starts with tokenizing a parallel corpus. Some languages such as Chinese do not incorporate spacing in their writing system, which creat...

Tagyoung Chung, Daniel Gildea

claim paper

Read More »

21

click to vote

IR
2007

108views Natural Language Processing» more IR 2007»

An empirical study of tokenization strategies for biomedical information retrieval

13 years 9 months ago

Download sifaka.cs.uiuc.edu

Due to the great variation of biological names in biomedical text, appropriate tokenization is an important preprocessing step for biomedical information retrieval. Despite its im...

Jing Jiang, ChengXiang Zhai

claim paper

Read More »

30

click to vote

NLE
2008

118views more NLE 2008»

Part-of-speech tagging of Modern Hebrew text

13 years 9 months ago

Download www.cs.technion.ac.il

Words in Semitic texts often consist of a concatenation of word segments, each corresponding to a Part-of-Speech (POS) category. Semitic words may be ambiguous with regard to thei...

Roy Bar-Haim, Khalil Sima'an, Yoad Winter

claim paper

Read More »

25

click to vote

ACL
1998

86views Computational Linguistics» more ACL 1998»

One Tokenization per Source

13 years 10 months ago

Download www.aclweb.org

We report in this paper the observation of one tokenization per source. That is, the same critical fragment in different sentences from the same source almost always realize one a...

Jin Guo

claim paper

Read More »

25

click to vote

LREC
2008

88views Education» more LREC 2008»

A Trainable Tokenizer, solution for multilingual texts and compound expression tokenization

13 years 10 months ago

Download www.lrec-conf.org

Tokenization is one of the initial steps done for almost any text processing task. It is not particularly recognized as a challenging task for English monolingual systems but it r...

Oana Frunza

claim paper

Read More »

Sciweavers

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers