Sciweavers

COLING
2000

Text Genre Detection Using Common Word Frequencies

13 years 6 months ago
Text Genre Detection Using Common Word Frequencies
In this paper we present a method for detecting the text genre quickly and easily following an approach originally proposed in authorship attribution studies which uses as style markers the frequencies of occurrence of the most frequent words in a training corpus (Burrows, 1992). In contrast to this approach we use the frequencies of occurrence of the most frequent words of the entire written language. Using as testing ground a part of the Wall Street Journal corpus, we show that the most frequent words of the British National Corpus, representing the most frequent words of the written English language, are more reliable discriminators of text genre in comparison to the most frequent words of the training corpus. Moreover, the fi'equencies of occurrence of the most common punctuation marks play an important role in terms of accurate text categorization as well as when dealing with training data of limited size.
Efstathios Stamatatos, Nikos Fakotakis, George K.
Added 01 Nov 2010
Updated 01 Nov 2010
Type Conference
Year 2000
Where COLING
Authors Efstathios Stamatatos, Nikos Fakotakis, George K. Kokkinakis
Comments (0)