Semi-supervised Document Classification with a Mislabeling Error Model

15 years 9 months ago

Download eprints.pascal-network.org

Abstract. This paper investigates a new extension of the Probabilistic Latent Semantic Analysis (PLSA) model [6] for text classification where the training set is partially labeled. The proposed approach iteratively labels the unlabeled documents and estimates the probabilities of its labeling errors. These probabilities are then taken into account in the estimation of the new model parameters before the next round. Our approach outperforms an earlier semi-supervised extension of PLSA introduced by [9] which is based on the use of fake labels. However, it maintains its simplicity and ability to solve multiclass problems. In addition, it gives valuable information about the most uncertain and difficult classes to label. We perform experiments over the 20Newsgroups, WebKB and Reuters document collections and show the effectiveness of our approach over two other semi-supervised algorithms applied to these text classification problems.

Anastasia Krithara, Massih-Reza Amini, Jean-Michel

Real-time Traffic

Earlier Semi-supervised Extension | ECIR 2008 | Information Technology | Probabilistic Latent Semantic Analysis | Text Classification |

claim paper

Post Info
More Details (n/a)

Added	29 Oct 2010
Updated	29 Oct 2010
Type	Conference
Year	2008
Where	ECIR
Authors	Anastasia Krithara, Massih-Reza Amini, Jean-Michel Renders, Cyril Goutte

Comments (0)

Sciweavers

Semi-supervised Document Classification with a Mislabeling Error Model

Earlier Semi-supervised Extension | ECIR 2008 | Information Technology | Probabilistic Latent Semantic Analysis | Text Classification |

Explore & Download

Productivity Tools

Sciweavers