A Stochastic Finite-State Word-Segmentation Algorithm for Chinese

13 years 5 months ago

Download www.aclweb.org

We present a stochastic finite-state model for segmenting Chinese text into dictionary entries and productively derived words, and providing pronunciations for these words; the method incorporates a class-based model in its treatment of personal names. We also evaluate the system's performance, taking into account the fact that people often do not agree on a single segmentation. THE PROBLEM The initial step of any text analysis task is the tokenization of the input into words. For many writing systems, using whitespace as a delimiter for words yields reasonable results. However, for Chinese and other systems where whitespace is not used to delimit words, such trivial schemes will not work. Chinese writing is morphosyllabic (DeFrancis, 1984), meaning that each hanzi- 'Chinese character' - (nearly always) represents a single syllable that is (usually) also a single morpheme. Since in Chinese, as in English, words may be polysyllabic, and since hanzi are written with no in...

Richard Sproat, Chilin Shih, William Gale, Nancy C

Real-time Traffic

ACL 1994 | ACL 2007 | Chinese Text | Stochastic Finite-state Model | Words Yields |

claim paper

Added	02 Nov 2010
Updated	02 Nov 2010
Type	Conference
Year	1994
Where	ACL
Authors	Richard Sproat, Chilin Shih, William Gale, Nancy Chang

Sciweavers

A Stochastic Finite-State Word-Segmentation Algorithm for Chinese

ACL 1994 | ACL 2007 | Chinese Text | Stochastic Finite-state Model | Words Yields |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers