Sciweavers

CICLING
2008
Springer

A Probabilistic Model for Guessing Base Forms of New Words by Analogy

13 years 6 months ago
A Probabilistic Model for Guessing Base Forms of New Words by Analogy
Language software applications encounter new words, e.g., acronyms, technical terminology, loan words, names or compounds of such words. Looking at English, one might assume that they appear in base form, i.e., the lexical look-up form. However, in more highly inflecting languages like Finnish or Swahili only 4050 % of new words appear in base form. In order to index documents or discover translations for these languages, it would be useful to reduce new words to their base forms as well. We often have access to analyzes for more frequent words which shape our intuition for how new words will inflect. We formalize this into a probabilistic model for lemmatization of new words using analogy, i.e., guessing base forms, and test the model on English, Finnish, Swedish and Swahili demonstrating that we get a recall of 89-99 % with an average precision of 76-94 % depending on language and the amount of training material.
Krister Lindén
Added 12 Oct 2010
Updated 12 Oct 2010
Type Conference
Year 2008
Where CICLING
Authors Krister Lindén
Comments (0)