Sciweavers

ICML
2003
IEEE

Text Bundling: Statistics Based Data-Reduction

14 years 5 months ago
Text Bundling: Statistics Based Data-Reduction
As text corpora become larger, tradeoffs between speed and accuracy become critical: slow but accurate methods may not complete in a practical amount of time. In order to make the training data a manageable size, a data reduction technique may be necessary. Subsampling, for example, speeds up a classifier by randomly removing training points. In this paper, we describe an alternate method for reducing the number of training points by combining training points such that important statistical information is retained. Our algorithm keeps the same statistics that fast, linear-time text algorithms like Rocchio and Naive Bayes use. We provide empirical results that show our data reduction technique compares favorably to three other data reduction techniques on four standard text corpora.
Lawrence Shih, Jason D. Rennie, Yu-Han Chang, Davi
Added 17 Nov 2009
Updated 17 Nov 2009
Type Conference
Year 2003
Where ICML
Authors Lawrence Shih, Jason D. Rennie, Yu-Han Chang, David R. Karger
Comments (0)