Using synthetic data safely in classification

13 years 4 months ago
Using synthetic data safely in classification
When is it safe to use synthetic data in supervised classification? Trainable classifier technologies require large representative training sets consisting of samples labeled with their true class. Acquiring such training sets is difficult and costly. One way to alleviate this problem is to enlarge training sets by generating artificial, synthetic samples. Of course this immediately raises many questions, perhaps the first being "Why should we trust artificially generated data to be an accurate representative of the real distributions?" Other questions include "When will training on synthetic data work as well as - or better than training on real data ?". We distinguish between sample space (the set of real samples), parameter space (all samples that can be generated synthetically), and finally, feature space (the set of samples in terms of finite numerical values). In this paper, we discuss a series of experiments, in which we produced synthetic data in parameter ...
Jean Nonnemaker, Henry Baird
Added 17 Feb 2011
Updated 17 Feb 2011
Type Journal
Year 2009
Where DRR
Authors Jean Nonnemaker, Henry Baird
Comments (0)