k-Means Clustering

Partitions instances into k clusters by iteratively assigning points to the nearest centroid and recomputing centroids until convergence.

Some Methods for Classification and Analysis of Multivariate Observations

J.B. MacQueen · Proc. 5th Berkeley Symposium on Mathematical Statistics and Probability · 1967

View paper →Free — Project Euclid

Algorithm


k-Means is an unsupervised algorithm that groups instances into k clusters. It uses only numeric attributes (the class attribute is ignored). MacQueen's formulation is the standard Lloyd's algorithm:

Algorithm: Lloyd's k-Means

1. Initialise k centroids by random sampling from the dataset
2. repeat until no assignment changes (or maxIter reached):
a. Assign each instance to its nearest centroid (Euclidean distance)
b. Recompute each centroid as the mean of its assigned instances
3. Report assignments, centroid values, cluster sizes, and WCSS

Lloyd's algorithm is guaranteed to converge in a finite number of steps because the number of possible assignments is finite and WCSS strictly decreases at each iteration.

Theory → Code


Five stages of MacQueen's algorithm map to the implementation:

1

Initialization — random instances as starting centroids

// Shuffle the dataset and take the first k instances as seeds. // Simple, avoids duplicate centroids, sufficient for small-to-medium datasets. const shuffled = [...instances].sort(() => Math.random() - 0.5); let centroids = shuffled.slice(0, k).map(inst => idxs.map(i => inst[i] ?? 0));

2

Assignment step — argmin distance to centroid

const newAssign = projected.map(p => { let best = 0, bestD = Infinity; for (let ki = 0; ki < k; ki++) { const d = centroidDist(p, centroids[ki], idxs.map((_, j) => j)); if (d < bestD) { bestD = d; best = ki; } // argmin over centroids } return best; });

3

Update step — recompute centroids as cluster means

function updateCentroids(instances, assignments, k, idxs) { return Array.from({ length: k }, (_, ki) => { const members = instances.filter((_, j) => assignments[j] === ki); // Empty cluster: reassign to a random instance (avoids degenerate solutions) if (!members.length) return instances[Math.floor(Math.random() * instances.length)]; return idxs.map(i => members.reduce((s, m) => s + (m[i] ?? 0), 0) / members.length); }); }

4

Convergence check — stop when no assignments change

const changed = newAssign.some((a, i) => a !== assignments[i]); assignments = newAssign; if (!changed) break; // Lloyd's algorithm is guaranteed to converge

5

WCSS — within-cluster sum of squares (objective function)

WCSS=k=1KxCkxμk2\text{WCSS} = \sum_{k=1}^{K}\sum_{\mathbf{x}\,\in\, C_k} \|\mathbf{x} - \boldsymbol{\mu}_k\|^2
const wcss = projected.reduce((s, p, j) => s + idxs.reduce((s2, _, ii) => s2 + (p[ii] - centroids[assignments[j]][ii]) ** 2, 0), 0); // Lower WCSS → tighter clusters; only compare across models with the same k.

Theory


Theorem 1.(Convergence of Lloyd's algorithm) The k-Means algorithm terminates in a finite number of iterations.
Proof sketch.There are knk^n possible assignments of nn points to kk clusters — a finite set. The WCSS objective strictly decreases at each step: the assignment step minimises WCSS for fixed centroids (each point goes to its nearest centre); the update step (setting centroids to cluster means) minimises WCSS for fixed assignments. Since WCSS is bounded below by 0 and strictly decreases, the algorithm cannot revisit a prior assignment and must terminate.
Lemma 1.Lloyd's algorithm converges to a local minimum of WCSS, not necessarily the global minimum. The result depends on initialisation; multiple restarts with different random seeds and selecting the run with lowest WCSS is standard practice.
Lemma 2.WCSS is a monotone decreasing function of kk: WCSS(k+1)WCSS(k)\text{WCSS}(k+1) \le \text{WCSS}(k) always. Therefore WCSS alone cannot determine the optimal kk — use the elbow method or gap statistic instead.

Complexity


Complexity

Per iteration

O(nkd)O(n \cdot k \cdot d)assign each of n points to nearest of k centroids

Total

O(nkdt)O(n \cdot k \cdot d \cdot t)t iterations until convergence

Space

O((n+k)d)O((n + k) \cdot d)data + centroid storage

In practice tnt \ll n — most datasets converge in under 20 iterations. k-Means++ initialisation (choose centroids proportional to squared distance from existing centroids) improves both quality and convergence speed but is not implemented here.

Parameters


k
Number of clusters. Try several values and compare WCSS (elbow method) to choose.
Max iterations
Safety cap (default 100). Most real datasets converge in under 20 iterations.

Output


The Cluster tab reports cluster assignments for every instance, centroid values per attribute, cluster sizes, and the within-cluster sum of squares (WCSS).

machinelearning.js.org · open source · MIT · Marin's Web Site