Sciweavers

IJDLS
2010

Sampling the Web as Training Data for Text Classification

13 years 1 months ago
Sampling the Web as Training Data for Text Classification
Data acquisition is a major concern in text classification. The excessive human efforts required by conventional methods to build up quality training collection might not always be available to research workers. In this paper, we look into possibilities to automatically collect training data by sampling the Web with a set of given class names. The basic idea is to populate appropriate keywords and submit them as queries to search engines for acquiring training data. Two methods are presented in this study: One method is based on sampling the common concepts among the classes, and the other based on sampling the discriminative concepts for each class. A series of experiments were carried out independently on two different datasets, and the result shows that the proposed methods significantly improve classifier performance even without using manually labeled training data. Our strategy for
Wei-Yen Day, Chun-Yi Chi, Ruey-Cheng Chen, Pu-Jen
Added 05 Mar 2011
Updated 05 Mar 2011
Type Journal
Year 2010
Where IJDLS
Authors Wei-Yen Day, Chun-Yi Chi, Ruey-Cheng Chen, Pu-Jen Cheng
Comments (0)