Cleaning disguised missing data: a heuristic approach

10 years 11 months ago
Cleaning disguised missing data: a heuristic approach
In some applications such as filling in a customer information form on the web, some missing values may not be explicitly represented as such, but instead appear as potentially valid data values. Such missing values are known as disguised missing data, which may impair the quality of data analysis severely, such as causing significant biases and misleading results in hypothesis tests, correlation analysis and regressions. The very limited previous studies on cleaning disguised missing data use outlier mining and distribution anomaly detection. They highly rely on domain background knowledge in specific applications and may not work well for the cases where the disguise values are inliers. To tackle the problem of cleaning disguised missing data, in this paper, we first model the distribution of disguised missing data, and propose the embedded unbiased sample heuristic. Then, we develop an effective and efficient method to identify the frequently used disguise values which capture the ...
Ming Hua, Jian Pei
Added 30 Nov 2009
Updated 30 Nov 2009
Type Conference
Year 2007
Where KDD
Authors Ming Hua, Jian Pei
Comments (0)