Semi-supervised approach to rapid and reliable labeling of large data sets

14 years 5 months ago

Download www-users.cs.umn.edu

Supervised classification methods have been shown to be very effective for a large number of applications. They require a training data set whose instances are labeled to indicate the correct class assignment. In many rapidly changing fields, like computer network traffic analysis, the availability of up-to-date labeled data sets is very limited. This is primarily a consequence of the excessively high cost of an expert manually labeling these large data sets. In this research, we propose a method, where the labeling of the data set is carried out in a semi-supervised manner with userspecified guarantees about the quality of the labeling. In our scheme, we assume that for each class, we have some heuristics available, each of which can identify instances of one particular class. The heuristics are assumed to have reasonable performance but they do not need to cover all instances of the class nor do they need to be perfectly reliable. We further assume that we have an infallible expert,...

György J. Simon, Vipin Kumar, Zhi-Li Zhang

Real-time Traffic