Support Vector Machine

Finds the maximum-margin hyperplane separating classes, with a soft-margin penalty for misclassifications — trained by stochastic gradient descent on hinge loss.

Support Vector Networks

C. Cortes & V. Vapnik · Machine Learning, Vol. 20 · 1995

View paper →Springer — institutional access may be required

Algorithm


A linear SVM finds the hyperplane w·x + b = 0 that separates classes with the largest possible margin (2/‖w‖). The soft-margin variant (Cortes & Vapnik) allows misclassifications via slack variables ξᵢ, controlled by the penalty parameter C.

Primal objective (soft-margin)

minw,b  12w2+Ciξis.t.yi(wxi+b)1ξi,  ξi0\min_{\mathbf{w},\,b}\;\tfrac{1}{2}\|\mathbf{w}\|^2 + C\sum_{i}\xi_i \quad \text{s.t.}\quad y_i(\mathbf{w}\cdot\mathbf{x}_i + b) \ge 1 - \xi_i,\;\xi_i \ge 0

Hinge loss (equivalent unconstrained form)

L(w,b)=λw2+Cimax ⁣(0,  1yi(wxi+b)),λ=1CnL(\mathbf{w}, b) = \lambda\|\mathbf{w}\|^2 + C\sum_{i}\max\!\bigl(0,\;1 - y_i(\mathbf{w}\cdot\mathbf{x}_i + b)\bigr),\quad \lambda = \tfrac{1}{Cn}

SGD update per instance

if yi(wxi+b)1y_i(\mathbf{w}\cdot\mathbf{x}_i + b) \ge 1:
w(12λη)w\mathbf{w} \leftarrow (1 - 2\lambda\eta)\,\mathbf{w} (L2 regularise only)
else:
w(12λη)w+ηCyixi\mathbf{w} \leftarrow (1 - 2\lambda\eta)\,\mathbf{w} + \eta C y_i \mathbf{x}_i (regularise + hinge)
bb+ηCyib \leftarrow b + \eta C y_i

Multi-class problems use one-vs-rest: one SVM per class. The class whose decision function w·x + b gives the highest score wins.

Theory → Code


1

Hinge loss — zero when correctly classified outside the margin

// Hinge loss contribution for instance i: // max(0, 1 − yᵢ · (w·xᵢ + b)) // = 0 if the point is on the correct side and beyond the margin // > 0 if the point is inside or on the wrong side of the margin const score = w.reduce((s, wj, j) => s + wj * X[i][j], b) * y[i]; const hingeActive = score < 1;

2

SGD with decaying learning rate per epoch

for (let epoch = 0; epoch < epochs; epoch++) { const lr = 1 / (lambda * (epoch + 1)); // decaying schedule for (let i = 0; i < n; i++) { const score = w.reduce((s, wj, j) => s + wj * X[i][j], b) * y[i]; if (score < 1) { // Hinge active: regularise + margin correction w = w.map((wj, j) => (1 - 2 * lambda * lr) * wj + lr * C * y[i] * X[i][j]); b += lr * C * y[i]; } else { // Hinge inactive: only regularise (shrink weights toward zero) w = w.map(wj => (1 - 2 * lambda * lr) * wj); } } }

3

One-vs-rest — train one linear SVM per class

const classifiers = classValues.map(cv => { const y = instances.map(r => r[classIndex] === cv ? 1 : -1); // SVM uses +1/-1 return { cv, ...trainBinary(X, y, d, C, epochs) }; });

4

Classify — argmax of the raw margin scores

export function classify(model, instance) { const x = featureIdxs.map((fi, j) => ((instance[fi] ?? 0) - mins[j]) / ranges[j]); let best = null, bestScore = -Infinity; for (const { cv, w, b } of classifiers) { const score = w.reduce((s, wj, j) => s + wj * x[j], b); // signed margin if (score > bestScore) { bestScore = score; best = cv; } } return best; }

Theory


Theorem 1.(Structural Risk Minimisation — Vapnik, 1995) The expected test error of a hard-margin SVM is bounded by:
E[error]    O ⁣(w2R2mγ2)\mathbb{E}[\text{error}] \;\le\; O\!\left(\frac{\|\mathbf{w}\|^2 R^2}{m\,\gamma^2}\right)

where γ=1/w\gamma = 1/\|\mathbf{w}\| is the margin, RR is the radius of the smallest enclosing ball, and mm is the number of training examples. Maximising the margin minimises this generalisation bound — the theoretical justification for the SVM objective.

Theorem 2.(Representer Theorem) The optimal weight vector lies in the span of the training examples: w=iαiyixi\mathbf{w}^* = \sum_i \alpha_i y_i \mathbf{x}_i where αi0\alpha_i \ge 0. Only the support vectors (instances with αi>0\alpha_i > 0) contribute — all other training points are irrelevant to the decision boundary.

Complexity


Complexity

Training (SGD)

O(nde)O(n \cdot d \cdot e)n instances, d features, e epochs

Training (QP)

O(n2d)O(n^2 \cdot d)exact dual — not used here

Query

O(cd)O(c \cdot d)one dot product per class (one-vs-rest)

Space

O(cd)O(c \cdot d)one weight vector per class

Parameters


C
Regularisation strength (default 1.0). High C → smaller margin, fewer misclassifications. Low C → wider margin, more slack.
Epochs
SGD passes over training data (default 500). More epochs → finer convergence.

Notes


This implementation is a linear SVM only — it learns a hyperplane boundary. Kernel SVMs (RBF, polynomial) can separate non-linearly separable data but require computing a kernel matrix; they are not yet implemented here.

Only numeric attributes are used as features. Feature normalisation to [0, 1] is applied automatically — this is critical for SGD convergence and for the margin to be meaningful across attributes.

machinelearning.js.org · open source · MIT · Marin's Web Site