Classification
Regression
Clustering
Structure Learning
k-Means Clustering
Partitions instances into k clusters by iteratively assigning points to the nearest centroid and recomputing centroids until convergence.
Some Methods for Classification and Analysis of Multivariate Observations
J.B. MacQueen · Proc. 5th Berkeley Symposium on Mathematical Statistics and Probability · 1967
Algorithm
k-Means is an unsupervised algorithm that groups instances into k clusters. It uses only numeric attributes (the class attribute is ignored). MacQueen's formulation is the standard Lloyd's algorithm:
Algorithm: Lloyd's k-Means
Lloyd's algorithm is guaranteed to converge in a finite number of steps because the number of possible assignments is finite and WCSS strictly decreases at each iteration.
Theory → Code
Five stages of MacQueen's algorithm map to the implementation:
1
Initialization — random instances as starting centroids
2
Assignment step — argmin distance to centroid
3
Update step — recompute centroids as cluster means
4
Convergence check — stop when no assignments change
5
WCSS — within-cluster sum of squares (objective function)
Theory
Complexity
Complexity
Per iteration
— assign each of n points to nearest of k centroidsTotal
— t iterations until convergenceSpace
— data + centroid storageIn practice — most datasets converge in under 20 iterations. k-Means++ initialisation (choose centroids proportional to squared distance from existing centroids) improves both quality and convergence speed but is not implemented here.
Parameters
Output
The Cluster tab reports cluster assignments for every instance, centroid values per attribute, cluster sizes, and the within-cluster sum of squares (WCSS).
machinelearning.js.org · open source · MIT · Marin's Web Site