MachineLearning.js

Linear Regression models the target y as a linear function of the feature vector x: ŷ = w·x + b. The weights w and bias b minimise mean squared error (MSE) over the training set. This implementation uses gradient descent on the MSE loss.

Model

\hat{y}_i = \mathbf{w} \cdot \mathbf{x}_i + b

Loss (Mean Squared Error)

L(\mathbf{w}, b) = \frac{1}{n}\sum_{i}(\hat{y}_i - y_i)^2

Gradient

\frac{\partial L}{\partial \mathbf{w}} = \frac{2}{n}X^\top(X\mathbf{w} - \mathbf{y}), \qquad \frac{\partial L}{\partial b} = \frac{2}{n}\sum_{i}(\hat{y}_i - y_i)

Update

\mathbf{w} \leftarrow \mathbf{w} - \eta\,\frac{\partial L}{\partial \mathbf{w}}, \qquad b \leftarrow b - \eta\,\frac{\partial L}{\partial b}

The closed-form Normal Equations solution — $\mathbf{w} = (X^\top X)^{-1} X^\top \mathbf{y}$ — is mathematically equivalent but requires a matrix inverse that is numerically unstable for ill-conditioned feature sets. Gradient descent is preferred for general use.

Theory → Code

Normalise features and target to [0,1] for stable convergence

// Per-attribute min/max over training set const mins = featureIdxs.map(i => instances.reduce((m, r) => Math.min(m, r[i] ?? m), Infinity)); const ranges = mins.map((mn, j) => maxs[j] - mn || 1); // Target normalisation const tMin = Math.min(...targets); const tRange = Math.max(...targets) - tMin || 1; const y = instances.map(r => ((r[classIndex] ?? 0) - tMin) / tRange);

Gradient descent — update weights by MSE gradient

for (let epoch = 0; epoch < epochs; epoch++) { const dw = new Array(d).fill(0); let db = 0; for (let i = 0; i < n; i++) { const err = w.reduce((s, wj, j) => s + wj * X[i][j], b) - y[i]; // ŷᵢ − yᵢ for (let j = 0; j < d; j++) dw[j] += err * X[i][j]; // ∂L/∂wⱼ db += err; // ∂L/∂b } for (let j = 0; j < d; j++) w[j] -= (learningRate / n) * dw[j]; b -= (learningRate / n) * db; }

Predict — un-normalise the output back to original scale

export function predict(model, instance) { const x = featureIdxs.map((fi, j) => ((instance[fi] ?? 0) - mins[j]) / ranges[j]); const normPred = x.reduce((s, xi, j) => s + w[j] * xi, b); return normPred * tRange + tMin; // rescale back to original target units }

Regression metrics — R², MAE, RMSE

export function metrics(preds, actuals) { const n = preds.length; const mean = actuals.reduce((s, v) => s + v, 0) / n; const sse = preds.reduce((s, p, i) => s + (p - actuals[i]) ** 2, 0); const sst = actuals.reduce((s, a) => s + (a - mean) ** 2, 0); return { r2: 1 - sse / sst, // coefficient of determination mae: preds.reduce((s, p, i) => s + Math.abs(p - actuals[i]), 0) / n, rmse: Math.sqrt(sse / n), }; }

Theory

Theorem 1.The MSE loss

L(\mathbf{w}, b)

is convex in

\mathbf{w}

and

b

. Gradient descent with step size

\eta < 2/\lambda_{\max}

(where

\lambda_{\max}

is the largest eigenvalue of

X^\top X / n

) converges to the unique global minimum.

Proof sketch.The Hessian of MSE is

\nabla^2 L = 2X^\top X / n \succeq 0

, which is positive semi-definite — so MSE is convex. The unique minimiser satisfies the normal equations

X^\top X \mathbf{w} = X^\top \mathbf{y}

. Under the step-size condition, the gradient iterates contract at rate

(1 - \eta\lambda_{\min})^t \to 0

.□

Theorem 2.(Gauss–Markov) Among all linear unbiased estimators, the OLS solution

\hat{\mathbf{w}} = (X^\top X)^{-1} X^\top \mathbf{y}

has the smallest variance — it is the Best Linear Unbiased Estimator (BLUE).

Complexity

Training (GD)

O(n \cdot d \cdot e)

— n instances, d features, e epochs

Normal Eqs.

O(n \cdot d^2 + d^3)

— matrix multiply + inversion

Query

O(d)

— single dot product

Space

O(d)

— one weight vector

Normal equations are exact but scale poorly: $O(d^3)$ matrix inversion becomes prohibitive when $d \gg 10^3$ . Gradient descent trades exactness for scalability and is preferred in practice.

Parameters

Learning rate (η)

Gradient step size (default 0.05). Lower values are safer but slower to converge.

Epochs

Number of full gradient passes (default 800). Watch for R² plateau to judge convergence.

Evaluation Metrics

R² (coefficient of determination)— proportion of variance in the target explained by the model. R²=1 is a perfect fit; R²=0 means the model does no better than predicting the mean; negative R² means it's worse.

MAE (mean absolute error) — average absolute difference between predicted and actual values, in the same units as the target.

RMSE (root mean squared error) — square root of MSE; penalises large errors more than MAE, also in target units.

Notes

Only numeric attributes are used as features. The class attribute must also be numeric (regression task). If the class is nominal, use a classifier instead.

Linear Regression is available in the Explorer's Classify tab when the dataset has a numeric class attribute. The tab automatically switches between classification and regression algorithms based on the class type.

On this page

Reference Algorithm Theory → Code Theory Complexity Parameters Evaluation Metrics Notes

Neural Network (MLP)k-Means Clustering

machinelearning.js.org · open source · MIT · Marin's Web Site