Sciweavers

IPM
2008

Author identification: Using text sampling to handle the class imbalance problem

13 years 4 months ago
Author identification: Using text sampling to handle the class imbalance problem
Authorship analysis of electronic texts assists digital forensics and anti-terror investigation. Author identification can be seen as a single-label multi-class text categorization problem. Very often, there are extremely few training texts at least for some of the candidate authors or there is a significant variation in the text-length among the available training texts of the candidate authors. Moreover, in this task usually there is no similarity between the distribution of training and test texts over the classes, that is, a basic assumption of inductive learning does not apply. In this paper, we present methods to handle imbalanced multi-class textual datasets. The main idea is to segment the training texts into text samples according to the size of the class, thus producing a fairer classification model. Hence, minority classes can be segmented into many short samples and majority classes into less and longer samples. We explore text sampling methods in order to construct a trai...
Efstathios Stamatatos
Added 12 Dec 2010
Updated 12 Dec 2010
Type Journal
Year 2008
Where IPM
Authors Efstathios Stamatatos
Comments (0)