Sciweavers

KDD
2002
ACM

Generalization Methods in Bioinformatics

14 years 5 months ago
Generalization Methods in Bioinformatics
Protein secondary structure prediction and high-throughput drug screen data mining are two important applications in bioinformatics. The data is represented in sparse feature spaces and can be unrepresentative of future data. Supervised learners in this context will display their inherent bias toward certain solutions, generally solutions that t the training set well. In this paper, we rst describe an ensemble approach using subsampling that scales well with dataset size. A su cient number of ensemble members using subsamples of the data can yield a more accurate classi er than a single classi er using the entire dataset. Experiments on several datasets demonstrate the e ectiveness of the approach. We report results from the KDD Cup 2001 drug discovery dataset in which our approach yields a higher weighted accuracy than the winning entry. We then extend our ensemble approach to create an over-generalized classier for prediction by reducing the individual subsample size. The ensemble s...
Steven Eschrich, Nitesh V. Chawla, Lawrence O. Hal
Added 30 Nov 2009
Updated 30 Nov 2009
Type Conference
Year 2002
Where KDD
Authors Steven Eschrich, Nitesh V. Chawla, Lawrence O. Hall
Comments (0)