Sciweavers

KDD
1998
ACM

Scaling Clustering Algorithms to Large Databases

13 years 8 months ago
Scaling Clustering Algorithms to Large Databases
Practical clustering algorithms require multiple data scans to achieve convergence. For large databases, these scans become prohibitively expensive. We present a scalable clustering framework applicable to a wide class of iterative clustering. We require at most one scan of the database. In this work, the framework is instantiated and numerically justified with the popular K-Means clustering algorithm. The method is based on identifying regions of the data that are compressible, regions that must be maintained in memory, and regions that are discardable. The algorithm operates within the confines of a limited memory buffer. Empirical results demonstrate that the scalable scheme outperforms a sampling-based approach. In our scheme, data resolution is preserved to the extent possible based upon the size of the allocated memory buffer and the fit of current clustering model to the data. The framework is naturally extended to update multiple clustering models simultaneously. We empiricall...
Paul S. Bradley, Usama M. Fayyad, Cory Reina
Added 06 Aug 2010
Updated 06 Aug 2010
Type Conference
Year 1998
Where KDD
Authors Paul S. Bradley, Usama M. Fayyad, Cory Reina
Comments (0)