Programming the K-means clustering algorithm in SQL

15 years 11 months ago

Download www.cs.uiuc.edu

Using SQL has not been considered an eﬃcient and feasible way to implement data mining algorithms. Although this is true for many data mining, machine learning and statistical algorithms, this work shows it is feasible to get an eﬃcient SQL implementation of the well-known K-means clustering algorithm that can work on top of a relational DBMS. The article emphasizes both correctness and performance. From a correctness point of view the article explains how to compute Euclidean distance, nearest-cluster queries and updating clustering results in SQL. From a performance point of view it is explained how to cluster large data sets deﬁning and indexing tables to store and retrieve intermediate and ﬁnal results, optimizing and avoiding joins, optimizing and simplifying clustering aggregations, and taking advantage of suﬃcient statistics. Experiments evaluate scalability with synthetic data sets varying size and dimensionality. The proposed K-means implementation can cluster large...

Carlos Ordonez

Real-time Traffic

Data Mining | KDD 2004 | Large Data Sets | Relational Dbms |

claim paper

Post Info
More Details (n/a)

Added	02 Jul 2010
Updated	02 Jul 2010
Type	Conference
Year	2004
Where	KDD
Authors	Carlos Ordonez

Comments (0)

Sciweavers

Programming the K-means clustering algorithm in SQL

Data Mining | KDD 2004 | Large Data Sets | Relational Dbms |

Explore & Download

Productivity Tools

Sciweavers