

Urdu Word Segmentation

14 years 1 days ago
Urdu Word Segmentation
Word Segmentation is the foremost obligatory task in almost all the NLP applications where the initial phase requires tokenization of input into words. Urdu is amongst the Asian languages that face word segmentation challenge. However, unlike other Asian languages, word segmentation in Urdu not only has space omission errors but also space insertion errors. This paper discusses how orthographic and linguistic features in Urdu trigger these two problems. It also discusses the work that has been done to tokenize input text. We employ a hybrid solution that performs an n-gram ranking on top of rule based maximum matching heuristic. Our best technique gives an error detection of 85.8% and overall accuracy of 95.8%. Further issues and possible future directions are also discussed.
Nadir Durrani, Sarmad Hussain
Added 14 Feb 2011
Updated 14 Feb 2011
Type Journal
Year 2010
Authors Nadir Durrani, Sarmad Hussain
Comments (0)