Linear Regression

Models a numeric target as a weighted sum of input features, minimising mean squared error via gradient descent.

Pattern Recognition and Machine Learning — Chapter 3

C.M. Bishop · Springer (free draft available from Microsoft Research) · 2006

Algorithm


Linear Regression models the target y as a linear function of the feature vector x: ŷ = w·x + b. The weights w and bias b minimise mean squared error (MSE) over the training set. This implementation uses gradient descent on the MSE loss.

Model

y^i=wxi+b\hat{y}_i = \mathbf{w} \cdot \mathbf{x}_i + b

Loss (Mean Squared Error)

L(w,b)=1ni(y^iyi)2L(\mathbf{w}, b) = \frac{1}{n}\sum_{i}(\hat{y}_i - y_i)^2

Gradient

Lw=2nX(Xwy),Lb=2ni(y^iyi)\frac{\partial L}{\partial \mathbf{w}} = \frac{2}{n}X^\top(X\mathbf{w} - \mathbf{y}), \qquad \frac{\partial L}{\partial b} = \frac{2}{n}\sum_{i}(\hat{y}_i - y_i)

Update

wwηLw,bbηLb\mathbf{w} \leftarrow \mathbf{w} - \eta\,\frac{\partial L}{\partial \mathbf{w}}, \qquad b \leftarrow b - \eta\,\frac{\partial L}{\partial b}

The closed-form Normal Equations solution — w=(XX)1Xy\mathbf{w} = (X^\top X)^{-1} X^\top \mathbf{y} — is mathematically equivalent but requires a matrix inverse that is numerically unstable for ill-conditioned feature sets. Gradient descent is preferred for general use.

Theory → Code


1

Normalise features and target to [0,1] for stable convergence

// Per-attribute min/max over training set const mins = featureIdxs.map(i => instances.reduce((m, r) => Math.min(m, r[i] ?? m), Infinity)); const ranges = mins.map((mn, j) => maxs[j] - mn || 1); // Target normalisation const tMin = Math.min(...targets); const tRange = Math.max(...targets) - tMin || 1; const y = instances.map(r => ((r[classIndex] ?? 0) - tMin) / tRange);

2

Gradient descent — update weights by MSE gradient

for (let epoch = 0; epoch < epochs; epoch++) { const dw = new Array(d).fill(0); let db = 0; for (let i = 0; i < n; i++) { const err = w.reduce((s, wj, j) => s + wj * X[i][j], b) - y[i]; // ŷᵢ − yᵢ for (let j = 0; j < d; j++) dw[j] += err * X[i][j]; // ∂L/∂wⱼ db += err; // ∂L/∂b } for (let j = 0; j < d; j++) w[j] -= (learningRate / n) * dw[j]; b -= (learningRate / n) * db; }

3

Predict — un-normalise the output back to original scale

export function predict(model, instance) { const x = featureIdxs.map((fi, j) => ((instance[fi] ?? 0) - mins[j]) / ranges[j]); const normPred = x.reduce((s, xi, j) => s + w[j] * xi, b); return normPred * tRange + tMin; // rescale back to original target units }

4

Regression metrics — R², MAE, RMSE

export function metrics(preds, actuals) { const n = preds.length; const mean = actuals.reduce((s, v) => s + v, 0) / n; const sse = preds.reduce((s, p, i) => s + (p - actuals[i]) ** 2, 0); const sst = actuals.reduce((s, a) => s + (a - mean) ** 2, 0); return { r2: 1 - sse / sst, // coefficient of determination mae: preds.reduce((s, p, i) => s + Math.abs(p - actuals[i]), 0) / n, rmse: Math.sqrt(sse / n), }; }

Theory


Theorem 1.The MSE loss L(w,b)L(\mathbf{w}, b) is convex in w\mathbf{w} and bb. Gradient descent with step size η<2/λmax\eta < 2/\lambda_{\max} (where λmax\lambda_{\max} is the largest eigenvalue of XX/nX^\top X / n) converges to the unique global minimum.
Proof sketch.The Hessian of MSE is 2L=2XX/n0\nabla^2 L = 2X^\top X / n \succeq 0, which is positive semi-definite — so MSE is convex. The unique minimiser satisfies the normal equations XXw=XyX^\top X \mathbf{w} = X^\top \mathbf{y}. Under the step-size condition, the gradient iterates contract at rate (1ηλmin)t0(1 - \eta\lambda_{\min})^t \to 0.
Theorem 2.(Gauss–Markov) Among all linear unbiased estimators, the OLS solution w^=(XX)1Xy\hat{\mathbf{w}} = (X^\top X)^{-1} X^\top \mathbf{y} has the smallest variance — it is the Best Linear Unbiased Estimator (BLUE).

Complexity


Complexity

Training (GD)

O(nde)O(n \cdot d \cdot e)n instances, d features, e epochs

Normal Eqs.

O(nd2+d3)O(n \cdot d^2 + d^3)matrix multiply + inversion

Query

O(d)O(d)single dot product

Space

O(d)O(d)one weight vector

Normal equations are exact but scale poorly: O(d3)O(d^3) matrix inversion becomes prohibitive when d103d \gg 10^3. Gradient descent trades exactness for scalability and is preferred in practice.

Parameters


Learning rate (η)
Gradient step size (default 0.05). Lower values are safer but slower to converge.
Epochs
Number of full gradient passes (default 800). Watch for R² plateau to judge convergence.

Evaluation Metrics


R² (coefficient of determination)— proportion of variance in the target explained by the model. R²=1 is a perfect fit; R²=0 means the model does no better than predicting the mean; negative R² means it's worse.

MAE (mean absolute error) — average absolute difference between predicted and actual values, in the same units as the target.

RMSE (root mean squared error) — square root of MSE; penalises large errors more than MAE, also in target units.

Notes


Only numeric attributes are used as features. The class attribute must also be numeric (regression task). If the class is nominal, use a classifier instead.

machinelearning.js.org · open source · MIT · Marin's Web Site