Sciweavers

ACL
2003

Unsupervised Learning of Arabic Stemming Using a Parallel Corpus

13 years 6 months ago
Unsupervised Learning of Arabic Stemming Using a Parallel Corpus
This paper presents an unsupervised learning approach to building a non-English (Arabic) stemmer. The stemming model is based on statistical machine translation and it uses an English stemmer and a small (10K sentences) parallel corpus as its sole training resources. No parallel text is needed after the training phase. Monolingual, unannotated text can be used to further improve the stemmer by allowing it to adapt to a desired domain or genre. Examples and results will be given for Arabic , but the approach is applicable to any language that needs affix removal. Our resource-frugal approach results in 87.5% agreement with a state of the art, proprietary Arabic stemmer built using rules, affix lists, and human annotated text, in addition to an unsupervised component. Task-based evaluation using Arabic information retrieval indicates an improvement of 22-38% in average precision over unstemmed text, and 96% of the performance of the proprietary stemmer above.
Monica Rogati, J. Scott McCarley, Yiming Yang
Added 31 Oct 2010
Updated 31 Oct 2010
Type Conference
Year 2003
Where ACL
Authors Monica Rogati, J. Scott McCarley, Yiming Yang
Comments (0)