The Use of SVM for Chinese New Word Identification

13 years 10 months ago

Download research.microsoft.com

We present a study of new word identification (NWI) to improve the performance of a Chinese word segmenter. In this paper the distribution and types of new words are discussed empirically. In particular, we focus on the new words of two surface patterns, which account for more than 80% of new words in our data sets: NW11 (two-character new word) and NW21 (a bi-character word followed with a single character). NWI is defined as a problem of binary classification. A statistical learning approach based on a SVM classifier is used. Different features for NWI are explored, including in-word probability of a character (IWP), the analogy between new words and lexicon words, anti-word list, and frequency in documents. The experiments show that these features are useful for NWI. The Fscores of NWI we achieved are 64.4% and 54.7% for NW11 and NW21, respectively. The overall performance of the Chinese word segmenter could be improved by Roov 24.5% and F-score 6.5% in PK-close test of the 1st SIG...

Hongqiao Li, Changning Huang, Jianfeng Gao, Xiaozh

Real-time Traffic

Bi-character Word | Chinese Word Segmenter | IJCNLP 2004 | Statistical Learning Approach |

claim paper

» Unknown Word Extraction for Chinese Documents

» A New Prosodic Phrasing Model for Chinese TTS Systems

» Capturing Errors in Written Chinese Words

» Combining Machine Learning with Linguistic Heuristics for Chinese Word Segmentation

» Using Morphological and Syntactic Structures for Chinese Opinion Analysis

» Chinese Frame Identification using TCRF Model

» An IBMPC Environment For Chinese Corpus Analysis

» Multiclass SVM optimization using MCE training with application to topic identification

Post Info
More Details (n/a)

Added	02 Jul 2010
Updated	02 Jul 2010
Type	Conference
Year	2004
Where	IJCNLP
Authors	Hongqiao Li, Changning Huang, Jianfeng Gao, Xiaozhong Fan

Comments (0)

Sciweavers

The Use of SVM for Chinese New Word Identification

Bi-character Word | Chinese Word Segmenter | IJCNLP 2004 | Statistical Learning Approach |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers