Sciweavers

JSS
2008

Can k-NN imputation improve the performance of C4.5 with small software project data sets? A comparative evaluation

13 years 4 months ago
Can k-NN imputation improve the performance of C4.5 with small software project data sets? A comparative evaluation
Missing data is a widespread problem that can affect the ability to use data to construct effective prediction systems. We investigate a common machine learning technique that can tolerate missing values, namely C4.5, to predict cost using six real world software project databases. We analyze the predictive performance after using the k-NN missing data imputation technique to see if it is better to tolerate missing data or to try to impute missing values and then apply the C4.5 algorithm. For the investigation, we simulated 3 missingness mechanisms, 3 missing data patterns, and 5 missing data percentages. We found that the k-NN imputation can improve the prediction accuracy of C4.5. At the same time, both C4.5 and k-NN are little affected by the missingness mechanism, but that the missing data pattern and the missing data percentage have a strong negative impact upon prediction (or imputation) accuracy particularly if the missing data percentage exceeds 40%. 1
Qinbao Song, Martin J. Shepperd, Xiangru Chen, Jun
Added 13 Dec 2010
Updated 13 Dec 2010
Type Journal
Year 2008
Where JSS
Authors Qinbao Song, Martin J. Shepperd, Xiangru Chen, Jun Liu
Comments (0)