Text Bundling: Statistics Based Data-Reduction

14 years 5 months ago

Download www.hpl.hp.com

As text corpora become larger, tradeoffs between speed and accuracy become critical: slow but accurate methods may not complete in a practical amount of time. In order to make the training data a manageable size, a data reduction technique may be necessary. Subsampling, for example, speeds up a classifier by randomly removing training points. In this paper, we describe an alternate method for reducing the number of training points by combining training points such that important statistical information is retained. Our algorithm keeps the same statistics that fast, linear-time text algorithms like Rocchio and Naive Bayes use. We provide empirical results that show our data reduction technique compares favorably to three other data reduction techniques on four standard text corpora.

Lawrence Shih, Jason D. Rennie, Yu-Han Chang, Davi

Real-time Traffic

Data Reduction Technique | Data Reduction Techniques | ICML 2003 | Machine Learning | Training Points |

claim paper

Post Info
More Details (n/a)

Added	17 Nov 2009
Updated	17 Nov 2009
Type	Conference
Year	2003
Where	ICML
Authors	Lawrence Shih, Jason D. Rennie, Yu-Han Chang, David R. Karger

Comments (0)

Sciweavers

Text Bundling: Statistics Based Data-Reduction

Data Reduction Technique | Data Reduction Techniques | ICML 2003 | Machine Learning | Training Points |

Explore & Download

Productivity Tools

Document Tools

Image Tools

Sciweavers