Sciweavers

ACL
1994

A Stochastic Finite-State Word-Segmentation Algorithm for Chinese

13 years 5 months ago
A Stochastic Finite-State Word-Segmentation Algorithm for Chinese
We present a stochastic finite-state model for segmenting Chinese text into dictionary entries and productively derived words, and providing pronunciations for these words; the method incorporates a class-based model in its treatment of personal names. We also evaluate the system's performance, taking into account the fact that people often do not agree on a single segmentation. THE PROBLEM The initial step of any text analysis task is the tokenization of the input into words. For many writing systems, using whitespace as a delimiter for words yields reasonable results. However, for Chinese and other systems where whitespace is not used to delimit words, such trivial schemes will not work. Chinese writing is morphosyllabic (DeFrancis, 1984), meaning that each hanzi- 'Chinese character' - (nearly always) represents a single syllable that is (usually) also a single morpheme. Since in Chinese, as in English, words may be polysyllabic, and since hanzi are written with no in...
Richard Sproat, Chilin Shih, William Gale, Nancy C
Added 02 Nov 2010
Updated 02 Nov 2010
Type Conference
Year 1994
Where ACL
Authors Richard Sproat, Chilin Shih, William Gale, Nancy Chang
Comments (0)