Sciweavers

COLING
2000

Automatic Corpus-Based Thai Word Extraction with the C4.5 Learning Algorithm

13 years 6 months ago
Automatic Corpus-Based Thai Word Extraction with the C4.5 Learning Algorithm
"Word" is difficult to define in the languages that do not exhibit explicit word boundary, such as Thai. Traditional methods on defining words for this kind of languages have to depend on human judgement which bases on unclear criteria o1" procedures, and have several limitations. This paper proposes an algorithm for word extraction from Thai texts without borrowing a hand from word segmentation. We employ the c4.5 learning algorithm for this task. Several attributes such as string length, frequency, nmtual information and entropy are chosen for word/non-word determination. Our experiment yields high precision results about 85% in both training and test corpus.
Virach Sornlertlamvanich, Tanapong Potipiti, Thats
Added 01 Nov 2010
Updated 01 Nov 2010
Type Conference
Year 2000
Where COLING
Authors Virach Sornlertlamvanich, Tanapong Potipiti, Thatsanee Charoenporn
Comments (0)