Sciweavers

KDD
1998
ACM

Large Datasets Lead to Overly Complex Models: An Explanation and a Solution

13 years 7 months ago
Large Datasets Lead to Overly Complex Models: An Explanation and a Solution
This paper explores unexpected results that lie at the intersection of two common themes in the KDD community: large datasets and the goal of building compact models. Experiments with many di erent datasets and several model construction algorithms (including tree learning algorithms such as c4.5 with three di erent pruning methods, and rule learning algorithms such as c4.5rules and ripper) show that increasing the amount of data used to build a model often results in a linear increase in model size, even when that additional complexity results in no signi cant increase in model accuracy. Despite the promise of better parameter estimation held out by large datasets, as a practical matter, models built with large amounts of data are often needlessly complex and cumbersome. In the case of decision trees, the cause of this pathology is identi ed as a bias inherent in several common pruning techniques. Pruning errors made low in the tree, where there is insu cient data to make accurate pa...
Tim Oates, David Jensen
Added 06 Aug 2010
Updated 06 Aug 2010
Type Conference
Year 1998
Where KDD
Authors Tim Oates, David Jensen
Comments (0)