Sciweavers

KDD
2004
ACM

Programming the K-means clustering algorithm in SQL

13 years 9 months ago
Programming the K-means clustering algorithm in SQL
Using SQL has not been considered an efficient and feasible way to implement data mining algorithms. Although this is true for many data mining, machine learning and statistical algorithms, this work shows it is feasible to get an efficient SQL implementation of the well-known K-means clustering algorithm that can work on top of a relational DBMS. The article emphasizes both correctness and performance. From a correctness point of view the article explains how to compute Euclidean distance, nearest-cluster queries and updating clustering results in SQL. From a performance point of view it is explained how to cluster large data sets defining and indexing tables to store and retrieve intermediate and final results, optimizing and avoiding joins, optimizing and simplifying clustering aggregations, and taking advantage of sufficient statistics. Experiments evaluate scalability with synthetic data sets varying size and dimensionality. The proposed K-means implementation can cluster large...
Carlos Ordonez
Added 02 Jul 2010
Updated 02 Jul 2010
Type Conference
Year 2004
Where KDD
Authors Carlos Ordonez
Comments (0)