Text Genre Detection Using Common Word Frequencies

15 years 6 months ago

Download acl.ldc.upenn.edu

In this paper we present a method for detecting the text genre quickly and easily following an approach originally proposed in authorship attribution studies which uses as style markers the frequencies of occurrence of the most frequent words in a training corpus (Burrows, 1992). In contrast to this approach we use the frequencies of occurrence of the most frequent words of the entire written language. Using as testing ground a part of the Wall Street Journal corpus, we show that the most frequent words of the British National Corpus, representing the most frequent words of the written English language, are more reliable discriminators of text genre in comparison to the most frequent words of the training corpus. Moreover, the fi'equencies of occurrence of the most common punctuation marks play an important role in terms of accurate text categorization as well as when dealing with training data of limited size.

Efstathios Stamatatos, Nikos Fakotakis, George K.

Real-time Traffic

COLING 2000 | COLING 2008 | Frequent Words | Text Genre | Training Corpus |

claim paper

» Partofspeech histograms for genre classification of text

» Good Bigrams

» Stop word detection in compressed textual images An experiment on indic script documents

» Design Compilation and Preliminary Analyses of Balanced Corpus of Contemporary Written Jap...

» Improving binary classification on text problems using differential word features

» Detecting Word Substitutions in Text

» The Automatic Extraction of Open Compounds from Text Corpora

» Robust Segmentation of Unconstrained Online Handwritten Documents

Post Info
More Details (n/a)

Added	01 Nov 2010
Updated	01 Nov 2010
Type	Conference
Year	2000
Where	COLING
Authors	Efstathios Stamatatos, Nikos Fakotakis, George K. Kokkinakis

Comments (0)

Sciweavers

Text Genre Detection Using Common Word Frequencies

COLING 2000 | COLING 2008 | Frequent Words | Text Genre | Training Corpus |

Explore & Download

Productivity Tools

Sciweavers