Sciweavers

Free Online Productivity Tools i2Speak i2Symbol i2OCR iTex2Img iWeb2Print iWeb2Shot i2Type iPdf2Split iPdf2Merge i2Bopomofo i2Arabic i2Style i2Image i2PDF iLatex2Rtf Sci2ools

11

NAACL
2010

favoriteEmaildiscussreport

208views Computational Linguistics» more NAACL 2010»

Urdu Word Segmentation

13 years 2 months ago

Urdu Word Segmentation

Download www.crulp.org

Word Segmentation is the foremost obligatory task in almost all the NLP applications where the initial phase requires tokenization of input into words. Urdu is amongst the Asian languages that face word segmentation challenge. However, unlike other Asian languages, word segmentation in Urdu not only has space omission errors but also space insertion errors. This paper discusses how orthographic and linguistic features in Urdu trigger these two problems. It also discusses the work that has been done to tokenize input text. We employ a hybrid solution that performs an n-gram ranking on top of rule based maximum matching heuristic. Our best technique gives an error detection of 85.8% and overall accuracy of 95.8%. Further issues and possible future directions are also discussed.

Nadir Durrani, Sarmad Hussain

Real-time Traffic

Computational Linguistics | NAACL 2010 | Urdu | Word Segmentation | Word Segmentation Challenge |

claim paper

Related Content

» Towards Searchable Digital Urdu Libraries A Word Spotting Based Retrieval Approach

» Online Urdu Character Recognition System

» Urdu and Hindi Translation and sharing of linguistic resources

» HinditoUrdu Machine Translation through Transliteration

» Inferring Subcat Frames of Verbs in Urdu

» EdgeBased Features for Localization of Artificial Urdu Text in Video Images

» Fast Online Training with FrequencyAdaptive Learning Rates for Chinese Word Segmentation a...

» Using Rejuvenation to Improve Particle Filtering for Bayesian Word Segmentation

» A Modality Lexicon and its use in Automatic Tagging

Post Info
More Details (n/a)

Added	14 Feb 2011
Updated	14 Feb 2011
Type	Journal
Year	2010
Where	NAACL
Authors	Nadir Durrani, Sarmad Hussain

Comments (0)