MachineLearning.js

Finds the maximum-margin hyperplane separating classes, with a soft-margin penalty for misclassifications — trained by stochastic gradient descent on hinge loss.

Support Vector Networks

C. Cortes & V. Vapnik · Machine Learning, Vol. 20 · 1995

View paper →Springer — institutional access may be required

Algorithm

A linear SVM finds the hyperplane w·x + b = 0 that separates classes with the largest possible margin (2/‖w‖). The soft-margin variant (Cortes & Vapnik) allows misclassifications via slack variables ξᵢ, controlled by the penalty parameter C.

Primal objective (soft-margin)

\min_{\mathbf{w},\,b}\;\tfrac{1}{2}\|\mathbf{w}\|^2 + C\sum_{i}\xi_i \quad \text{s.t.}\quad y_i(\mathbf{w}\cdot\mathbf{x}_i + b) \ge 1 - \xi_i,\;\xi_i \ge 0

Hinge loss (equivalent unconstrained form)

L(\mathbf{w}, b) = \lambda\|\mathbf{w}\|^2 + C\sum_{i}\max\!\bigl(0,\;1 - y_i(\mathbf{w}\cdot\mathbf{x}_i + b)\bigr),\quad \lambda = \tfrac{1}{Cn}

SGD update per instance

y_i(\mathbf{w}\cdot\mathbf{x}_i + b) \ge 1

\mathbf{w} \leftarrow (1 - 2\lambda\eta)\,\mathbf{w}

(L2 regularise only)

else:

\mathbf{w} \leftarrow (1 - 2\lambda\eta)\,\mathbf{w} + \eta C y_i \mathbf{x}_i

(regularise + hinge)

b \leftarrow b + \eta C y_i

Multi-class problems use one-vs-rest: one SVM per class. The class whose decision function w·x + b gives the highest score wins.

Theory → Code

Hinge loss — zero when correctly classified outside the margin

// Hinge loss contribution for instance i: // max(0, 1 − yᵢ · (w·xᵢ + b)) // = 0 if the point is on the correct side and beyond the margin // > 0 if the point is inside or on the wrong side of the margin const score = w.reduce((s, wj, j) => s + wj * X[i][j], b) * y[i]; const hingeActive = score < 1;

SGD with decaying learning rate per epoch

for (let epoch = 0; epoch < epochs; epoch++) { const lr = 1 / (lambda * (epoch + 1)); // decaying schedule for (let i = 0; i < n; i++) { const score = w.reduce((s, wj, j) => s + wj * X[i][j], b) * y[i]; if (score < 1) { // Hinge active: regularise + margin correction w = w.map((wj, j) => (1 - 2 * lambda * lr) * wj + lr * C * y[i] * X[i][j]); b += lr * C * y[i]; } else { // Hinge inactive: only regularise (shrink weights toward zero) w = w.map(wj => (1 - 2 * lambda * lr) * wj); } } }

One-vs-rest — train one linear SVM per class

const classifiers = classValues.map(cv => { const y = instances.map(r => r[classIndex] === cv ? 1 : -1); // SVM uses +1/-1 return { cv, ...trainBinary(X, y, d, C, epochs) }; });

Classify — argmax of the raw margin scores

export function classify(model, instance) { const x = featureIdxs.map((fi, j) => ((instance[fi] ?? 0) - mins[j]) / ranges[j]); let best = null, bestScore = -Infinity; for (const { cv, w, b } of classifiers) { const score = w.reduce((s, wj, j) => s + wj * x[j], b); // signed margin if (score > bestScore) { bestScore = score; best = cv; } } return best; }

Theory

Theorem 1.(Structural Risk Minimisation — Vapnik, 1995) The expected test error of a hard-margin SVM is bounded by:

\mathbb{E}[\text{error}] \;\le\; O\!\left(\frac{\|\mathbf{w}\|^2 R^2}{m\,\gamma^2}\right)

where $\gamma = 1/\|\mathbf{w}\|$ is the margin, $R$ is the radius of the smallest enclosing ball, and $m$ is the number of training examples. Maximising the margin minimises this generalisation bound — the theoretical justification for the SVM objective.

Theorem 2.(Representer Theorem) The optimal weight vector lies in the span of the training examples:

\mathbf{w}^* = \sum_i \alpha_i y_i \mathbf{x}_i

where

\alpha_i \ge 0

. Only the support vectors (instances with

\alpha_i > 0

) contribute — all other training points are irrelevant to the decision boundary.

Complexity

Training (SGD)

O(n \cdot d \cdot e)

— n instances, d features, e epochs

Training (QP)

O(n^2 \cdot d)

— exact dual — not used here

Query

O(c \cdot d)

— one dot product per class (one-vs-rest)

Space

O(c \cdot d)

— one weight vector per class

Parameters

Regularisation strength (default 1.0). High C → smaller margin, fewer misclassifications. Low C → wider margin, more slack.

Epochs

SGD passes over training data (default 500). More epochs → finer convergence.

Notes

This implementation is a linear SVM only — it learns a hyperplane boundary. Kernel SVMs (RBF, polynomial) can separate non-linearly separable data but require computing a kernel matrix; they are not yet implemented here.

Only numeric attributes are used as features. Feature normalisation to [0, 1] is applied automatically — this is critical for SGD convergence and for the margin to be meaningful across attributes.

SVMs excel in high-dimensional spaces and work well even when features outnumber instances. They tend to be sensitive to C — try a few values (0.1, 1, 10) to tune.

On this page

Original Paper Algorithm Theory → Code Theory Complexity Parameters Notes

Logistic Regression Neural Network (MLP)

machinelearning.js.org · open source · MIT · Marin's Web Site