Logistic Regression

Models the probability of each class using a sigmoid (logistic) function, trained by minimising cross-entropy loss with gradient descent.

The Regression Analysis of Binary Sequences

D.R. Cox · Journal of the Royal Statistical Society, Series B, Vol. 20 · 1958

View paper →JSTOR — free with registration

Algorithm


Logistic Regression is a discriminative linear model. For binary classification it models P(y=1|x) = σ(w·x + b), where σ is the sigmoid function. The weights w and bias b are learned by minimising binary cross-entropy loss via gradient descent.

Sigmoid

σ(z)=11+ez\sigma(z) = \frac{1}{1 + e^{-z}}

Loss (binary cross-entropy), zᵢ = w·xᵢ + b

L(w,b)=1ni[yilogσ(zi)+(1yi)log(1σ(zi))]L(\mathbf{w}, b) = -\frac{1}{n}\sum_{i}\bigl[y_i\log\sigma(z_i) + (1-y_i)\log(1-\sigma(z_i))\bigr]

Gradient

Lw=1nX ⁣(σ(Xw)y),Lb=1ni(σ(zi)yi)\frac{\partial L}{\partial \mathbf{w}} = \frac{1}{n}X^\top\!\bigl(\sigma(X\mathbf{w}) - \mathbf{y}\bigr), \qquad \frac{\partial L}{\partial b} = \frac{1}{n}\sum_{i}(\sigma(z_i) - y_i)

Update

wwηLw,bbηLb\mathbf{w} \leftarrow \mathbf{w} - \eta\,\frac{\partial L}{\partial \mathbf{w}}, \qquad b \leftarrow b - \eta\,\frac{\partial L}{\partial b}

Multi-class problems are handled with one-vs-rest (OvR): one binary classifier is trained per class, and the class with the highest sigmoid score wins.

Theory → Code


1

Sigmoid function — squashes any real number into (0, 1)

function sigmoid(z) { return 1 / (1 + Math.exp(-z)); }

2

Gradient descent on cross-entropy for one binary classifier

for (let epoch = 0; epoch < epochs; epoch++) { const dw = new Array(d).fill(0); let db = 0; for (let i = 0; i < n; i++) { const err = sigmoid(w.reduce((s, wj, j) => s + wj * X[i][j], b)) - y[i]; for (let j = 0; j < d; j++) dw[j] += err * X[i][j]; // ∂L/∂w db += err; // ∂L/∂b } for (let j = 0; j < d; j++) w[j] -= (learningRate / n) * dw[j]; b -= (learningRate / n) * db; }

3

One-vs-rest — train one binary classifier per class value

const classifiers = classValues.map(cv => { const y = instances.map(r => r[classIndex] === cv ? 1 : 0); // 1 = this class, 0 = rest return { cv, ...trainBinary(X, y, d, learningRate, epochs) }; });

4

Classify — return the class with the highest sigmoid score

export function classify(model, instance) { const x = featureIdxs.map((fi, j) => ((instance[fi] ?? 0) - mins[j]) / ranges[j]); let best = null, bestScore = -Infinity; for (const { cv, w, b } of classifiers) { const score = sigmoid(w.reduce((s, wj, j) => s + wj * x[j], b)); if (score > bestScore) { bestScore = score; best = cv; } } return best; }

Theory


Lemma 1.The binary cross-entropy loss L(w,b)L(\mathbf{w}, b) is convex in (w,b)(\mathbf{w}, b) — gradient descent converges to the global minimum (assuming it exists; it may not when the data are linearly separable).
Proof sketch.The Hessian is H=1nXdiag(σi(1σi))XH = \tfrac{1}{n}X^\top \operatorname{diag}(\sigma_i(1-\sigma_i)) X. Since σ(z)(1σ(z))>0\sigma(z)(1-\sigma(z)) > 0 for all zRz \in \mathbb{R}, the diagonal matrix is positive definite, and X()XX^\top(\cdot)X is positive semi-definite. Therefore H0H \succeq 0 and LL is convex.
Lemma 2.When the training data are linearly separable, no finite weight vector minimises the cross-entropy loss — weights grow unboundedly to push σ(zi)1\sigma(z_i) \to 1 for all positive examples. Regularisation (L2 penalty) is required to recover a finite solution.

Complexity


Complexity

Training

O(ndec)O(n \cdot d \cdot e \cdot c)one binary classifier per class (one-vs-rest)

Query

O(cd)O(c \cdot d)one sigmoid score per class

Space

O(cd)O(c \cdot d)one weight vector per class

Parameters


Learning rate (η)
Step size for gradient updates (default 0.1). Too large → diverge; too small → slow convergence.
Epochs
Number of full passes over the training data (default 500).

Notes


Only numeric attributes are used as features in this implementation. Nominal attributes are currently ignored — convert them to numeric (e.g. one-hot encoding) before loading if they carry signal.

Feature normalisation is applied automatically: each numeric attribute is scaled to [0, 1] using training-set min/max. This is essential for gradient descent to converge reliably across attributes with different scales.

machinelearning.js.org · open source · MIT · Marin's Web Site