Vietnamese Word Segmentation

11 years 10 months ago
Vietnamese Word Segmentation
Word segmentation is the first and obligatory task for every NLP. For inflectional languages like English, French, Dutch,.. their word boundaries are simply assumed to be whitespaces or punctuations. Whilst in various Asian languages, including Chinese and Vietnamese, whitespaces are never used to determine the word boundaries, so one must resort to such higher levels of information as: information of morphology, syntax and even semantics and pragmatics. In this paper, we present a model combining WFST (Weighted Finite State Transducer) approach and Neural Network. This word segmentation system is applied to Text-to-speech of Vietnamese and POS-tagger of Vietnamese. We evaluate the performance by comparing its word segmentation results with the manually annotated corpus and its performance proves to be very good. Our algorithm achieves 97% of accuracy on a corpus of Vietnamese Electronic Textbooks.
Dinh Dien, Hoang Kiem, Nguyen Van Toan
Added 30 Jul 2010
Updated 30 Jul 2010
Type Conference
Year 2001
Authors Dinh Dien, Hoang Kiem, Nguyen Van Toan
Comments (0)