Addressing morphological variation in alphabetic languages

15 years 9 months ago

Download web.jhu.edu

The selection of indexing terms for representing documents is a key decision that limits how eﬀective subsequent retrieval can be. Often stemming algorithms are used to normalize surface forms, and thereby address the problem of not ﬁnding documents that contain words related to query terms through inﬂectional or derivational morphology. However, rule-based stemmers are not available for every language and it is unclear which methods for coping with morphology are most eﬀective. In this paper we investigate an assortment of techniques for representing text and compare these approaches using data sets in eighteen languages and ﬁve diﬀerent writing systems. We ﬁnd character n-gram tokenization to be highly eﬀective. In half of the languages examined n-grams outperform unnormalized words by more than 25%; in highly inﬂective languages relative improvements over 50% are obtained. In languages with less morphological richness the choice of tokenization is not as critical ...

Paul McNamee, Charles K. Nicholas, James Mayfield

Real-time Traffic

Character N-gram Tokenization | Character N-grams | Eﬀective Subsequent Retrieval | Information Retrieval | SIGIR 2009 |

claim paper

Added	28 May 2010
Updated	28 May 2010
Type	Conference
Year	2009
Where	SIGIR
Authors	Paul McNamee, Charles K. Nicholas, James Mayfield

Sciweavers

Addressing morphological variation in alphabetic languages

Character N-gram Tokenization | Character N-grams | Eﬀective Subsequent Retrieval | Information Retrieval | SIGIR 2009 |

Explore & Download

Productivity Tools

Sciweavers