In this paper, we describe disparity, a tool that does parallel, scalable anomaly detection for clusters. Disparity uses basic statistical methods and scalable reduction operation...
With the availability of large datasets in a variety of scientific and commercial domains, data mining has emerged as an important area within the last decade. Data mining techni...
Abstract. Scientists’ ability to generate and collect massive-scale datasets is increasing. As a result, constraints in data analysis capability rather than limitations in the av...
YongChul Kwon, Dylan Nunley, Jeffrey P. Gardner, M...
All Netflix Prize algorithms proposed so far are prohibitively costly for large-scale production systems. In this paper, we describe an efficient dataflow implementation of a coll...
Srivatsava Daruru, Nena M. Marin, Matt Walker, Joy...
Given the pairwise affinity relations associated with a set of data items, the goal of a clustering algorithm is to automatically partition the data into a small number of homogen...