Accurate Stemming of Dutch for Text Classification

13 years 5 months ago

Download odur.let.rug.nl

This paper investigates the use of stemming for classification of Dutch (email) texts. We introduce a stemmer, which combines dictionary lookup (implemented efficiently as a finite state automaton) with a rule-based backup strategy and show that it outperforms the Dutch Porter stemmer in terms of accuracy, while not being substantially slower. For text classification, the most important property of a stemmer is the number of words it (correctly) reduces to the same stem. Here the dictionary-based system also outperforms Porter. However, evaluation of a Bayesian text classification system with either no stemming or the Porter or dictionary-based stemmer on an email classification and a newspaper topic classification task does not lead to significant differences in accuracy. We conclude with an analysis of why this is the case.

Tanja Gaustad, Gosse Bouma

Real-time Traffic

CLIN 2001 | CLIN 2004 | Dutch Porter Stemmer | Stemmer | Text Classification |

claim paper

Post Info
More Details (n/a)

Added	31 Oct 2010
Updated	31 Oct 2010
Type	Conference
Year	2001
Where	CLIN
Authors	Tanja Gaustad, Gosse Bouma

Comments (0)

Sciweavers

Accurate Stemming of Dutch for Text Classification

CLIN 2001 | CLIN 2004 | Dutch Porter Stemmer | Stemmer | Text Classification |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers