Bayesian Unsupervised Word Segmentation with Nested Pitman-Yor Language Modeling

13 years 2 months ago

Download chasen.org

In this paper, we propose a new Bayesian model for fully unsupervised word segmentation and an efficient blocked Gibbs sampler combined with dynamic programming for inference. Our model is a nested hierarchical Pitman-Yor language model, where Pitman-Yor spelling model is embedded in the word model. We confirmed that it significantly outperforms previous reported results in both phonetic transcripts and standard datasets for Chinese and Japanese word segmentation. Our model is also considered as a way to construct an accurate word n-gram language model directly from characters of arbitrary language, without any "word" indications.

Daichi Mochihashi, Takeshi Yamada, Naonori Ueda

Real-time Traffic