Automating exploratory data analysis for efficient data mining

15 years 8 months ago

Download www.lans.ece.utexas.edu

Having access to large data sets for the purpose of predictive data mining does not guarantee good models, even when the size of the training data is virtually unlimited. Instead, careful data preprocessing is required, including data cleansing, handling missing values, attribute representation and encoding, and generating derived attributes. In particular, the selection of the most appropriate subset of attributes to include is a critical step in building an accurate and efficient model. We describe an automated approach to the exploration, preprocessing, and selection of the optimal attribute subset whose goal is to simplify the KDD process and dramatically shorten the time to build a model. Our implementation finds inappropriate and suspicious attributes, performs target dependency analysis, determining optimal attribute encoding, generates new derived attributes, and provides a flexible approach to attribute selection. We present results generated by an industrial KDD environment ...

Jonathan D. Becher, Pavel Berkhin, Edmund Freeman

Real-time Traffic