An Efficient Algorithm for Unsupervised Word Segmentation with Branching Entropy and MDL

14 years 11 months ago

Download www.aclweb.org

This paper proposes a fast and simple unsupervised word segmentation algorithm that utilizes the local predictability of adjacent character sequences, while searching for a leasteffort representation of the data. The model uses branching entropy as a means of constraining the hypothesis space, in order to efficiently obtain a solution that minimizes the length of a two-part MDL code. An evaluation with corpora in Japanese, Thai, English, and the "CHILDES" corpus for research in language development reveals that the algorithm achieves an accuracy, comparable to that of the state-of-the-art methods in unsupervised word segmentation, in a significantly reduced computational time.

Valentin Zhikov, Hiroya Takamura, Manabu Okumura

Real-time Traffic

Adjacent Character Sequences | EMNLP 2010 | Natural Language Processing | Unsupervised Word Segmentation | Word Segmentation Algorithm |

claim paper

Post Info
More Details (n/a)

Added	11 Feb 2011
Updated	11 Feb 2011
Type	Journal
Year	2010
Where	EMNLP
Authors	Valentin Zhikov, Hiroya Takamura, Manabu Okumura

Comments (0)

Sciweavers

An Efficient Algorithm for Unsupervised Word Segmentation with Branching Entropy and MDL

Adjacent Character Sequences | EMNLP 2010 | Natural Language Processing | Unsupervised Word Segmentation | Word Segmentation Algorithm |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers