Learning to Lemmatise Slovene Words

15 years 6 months ago

Download www-ai.ijs.si

Abstract. Automatic lemmatisation is a core application for many language processing tasks. In inﬂectionally rich languages, such as Slovene, assigning the correct lemma to each word in a running text is not trivial: nouns and adjectives, for instance, inﬂect for number and case, with a complex conﬁguration of endings and stem modiﬁcations. The problem is especially diﬃcult for unknown words, as word forms cannot be matched against a lexicon giving the correct lemma, its part-of-speech and paradigm class. The paper discusses a machine learning approach to the automatic lemmatisation of unknown words, in particular nouns and adjectives, in Slovene texts. We decompose the problem of learning to perform lemmatisation into two subproblems: the ﬁrst is to learn to perform morphosyntactic tagging, and the second is to learn to perform morphological analysis, which produces the lemma from the word form given the correct morphosyntactic tag. A statistics-based trigram tagger is use...

Saso Dzeroski, Tomaz Erjavec

Real-time Traffic