Enhancing Domain Portability of Chinese Segmentation Model Using Chi-Square Statistics and Bootstrapping

14 years 12 months ago

Download www.aclweb.org

Almost all Chinese language processing tasks involve word segmentation of the language input as their first steps, thus robust and reliable segmentation techniques are always required to make sure those tasks wellperformed. In recent years, machine learning and sequence labeling models such as Conditional Random Fields (CRFs) are often used in segmenting Chinese texts. Compared with traditional lexicon-driven models, machine learned models achieve higher F-measure scores. But machine learned models heavily depend on training materials. Although they can effectively process texts from the same domain as the training texts, they perform relatively poorly when texts from new domains are to be processed. In this paper, we propose to use 2 statistics when training an SVM-HMM based segmentation model to improve its ability to recall OOV words and then use bootstrapping strategies to maintain its ability to recall IV words. Experiments show the approach proposed in this paper enhances the do...

Baobao Chang, Dongxu Han

Real-time Traffic

EMNLP 2010 | Natural Language Processing | Segmentation Model | Texts | Word Segmentation |

claim paper

Post Info
More Details (n/a)

Added	11 Feb 2011
Updated	11 Feb 2011
Type	Journal
Year	2010
Where	EMNLP
Authors	Baobao Chang, Dongxu Han

Comments (0)

Sciweavers

Enhancing Domain Portability of Chinese Segmentation Model Using Chi-Square Statistics and Bootstrapping

EMNLP 2010 | Natural Language Processing | Segmentation Model | Texts | Word Segmentation |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers