MachineLearning.js

A Multilayer Perceptron with one hidden layer trained by stochastic gradient descent and backpropagation — Rumelhart, Hinton & Williams (1986). Learns non-linear decision boundaries by composing sigmoid activations in the hidden layer with a softmax output.

Learning representations by back-propagating errors

D.E. Rumelhart, G.E. Hinton & R.J. Williams · Nature 323:533–536 · 1986

View paper →The paper that popularised backpropagation for multi-layer networks and launched the modern deep learning era

Architecture

The MLP implemented here has three layers: an input layer, one hidden layer, and an output layer. Each layer is fully connected to the next; there are no skip connections or recurrences.

Input encoding. Nominal attributes with exactly two values are encoded as a single binary input (0 or 1). Nominal attributes with more than two values are one-hot encoded — a $k$ -valued attribute produces $k$ inputs. Numeric attributes are min-max normalised to $[0,1]$ . The class attribute is excluded from the input.

Hidden layer size. When not specified, the hidden size defaults to $H = \max\!\left(2,\;\bigl\lfloor (d + C) / 2 \bigr\rfloor\right)$ , where $d$ is the number of input features after encoding and $C$ is the number of classes. This is a common rule of thumb that sits between the input and output dimensions.

Output layer. Each output neuron corresponds to one class. Raw logits are passed through softmax to produce a valid probability distribution over classes. The predicted class is the argmax of this distribution.

MLP architecture

Layer 0 (input): d neurons — encoded feature vector x ∈ ℝ^d

Layer 1 (hidden): H neurons — z₁ = W₁x + b₁, a₁ = σ(z₁)

Layer 2 (output): C neurons — z₂ = W₂a₁ + b₂, a₂ = softmax(z₂)

Parameters: W₁ ∈ ℝ^(H×d), b₁ ∈ ℝ^H, W₂ ∈ ℝ^(C×H), b₂ ∈ ℝ^C

Weights init: Glorot uniform — U[−√(6/(fan_in+fan_out)), +√(6/(fan_in+fan_out))]

Forward Pass

Given an encoded input vector $\mathbf{x} \in \mathbb{R}^d$ , the network computes predictions in two matrix-vector steps.

Hidden layer pre-activation and activation

\mathbf{z}_1 = W_1\mathbf{x} + \mathbf{b}_1 \in \mathbb{R}^H, \qquad \mathbf{a}_1 = \sigma(\mathbf{z}_1) = \frac{1}{1+e^{-\mathbf{z}_1}}

Output layer pre-activation and softmax

\mathbf{z}_2 = W_2\mathbf{a}_1 + \mathbf{b}_2 \in \mathbb{R}^C, \qquad \hat{y}_c = \frac{e^{z_{2,c}}}{\sum_{j=1}^{C} e^{z_{2,j}}}

The predicted class is $\hat{c} = \arg\max_c \hat{y}_c$ . Training minimises the cross-entropy loss summed over all training instances:

Cross-entropy loss

\mathcal{L} = -\sum_{n=1}^{N} \log \hat{y}_{c_n}

where $c_n$ is the true class index of instance $n$ .

Backpropagation

Gradients are computed by the chain rule, propagating the error signal from the output layer back through the hidden layer. The softmax + cross-entropy combination has a particularly clean combined derivative.

Output layer error (combined softmax + cross-entropy gradient)

\boldsymbol{\delta}_2 = \hat{\mathbf{y}} - \mathbf{e}_{c}

where $\mathbf{e}_c$ is the one-hot vector for the true class. This is the gradient of the cross-entropy loss with respect to the pre-activation $\mathbf{z}_2$ .

Output weight gradients

\frac{\partial \mathcal{L}}{\partial W_2} = \boldsymbol{\delta}_2 \mathbf{a}_1^\top, \qquad \frac{\partial \mathcal{L}}{\partial \mathbf{b}_2} = \boldsymbol{\delta}_2

Hidden layer error (backpropagated through sigmoid)

\boldsymbol{\delta}_1 = \left(W_2^\top \boldsymbol{\delta}_2\right) \odot \sigma'(\mathbf{a}_1), \qquad \sigma'(a) = a(1-a)

Hidden weight gradients

\frac{\partial \mathcal{L}}{\partial W_1} = \boldsymbol{\delta}_1 \mathbf{x}^\top, \qquad \frac{\partial \mathcal{L}}{\partial \mathbf{b}_1} = \boldsymbol{\delta}_1

Weights are updated by full-batch gradient descent after accumulating gradients over all $N$ training instances:

Parameter update

W \leftarrow W - \frac{\eta}{N} \nabla_W \mathcal{L}

where $\eta$ is the learning rate. This implementation uses $\eta = 0.05$ and runs for 200 epochs by default.

Theory → Code

Encode inputs — numeric normalisation and nominal one-hot

Forward pass — sigmoid hidden, softmax output

// Hidden layer: a1[h] = σ(W1[h,·]·x + b1[h]) const a1 = Array.from({ length: H }, (_, h) => { let z = b1[h]; for (let j = 0; j < inputSize; j++) z += W1[h * inputSize + j] * x[j]; return sigmoid(z); }); // Output layer: a2 = softmax(W2·a1 + b2) const z2 = Array.from({ length: C }, (_, c) => { let z = b2[c]; for (let h = 0; h < H; h++) z += W2[c * H + h] * a1[h]; return z; }); const a2 = softmax(z2); // probability distribution over C classes

Backward pass — accumulate gradients, then update

// δ₂ = a₂ − e_c (combined softmax + cross-entropy derivative) const d2 = a2.map((v, c) => v - (c === ci ? 1 : 0)); // Accumulate output weight gradients for (let c = 0; c < C; c++) { for (let h = 0; h < H; h++) dW2[c * H + h] += d2[c] * a1[h]; db2[c] += d2[c]; } // Backpropagate through sigmoid: δ₁[h] = (Σ_c W2[c,h] δ₂[c]) · a₁[h](1−a₁[h]) const d1 = Array.from({ length: H }, (_, h) => { let s = 0; for (let c = 0; c < C; c++) s += W2[c * H + h] * d2[c]; return s * a1[h] * (1 - a1[h]); // sigmoidD }); for (let h = 0; h < H; h++) { for (let j = 0; j < inputSize; j++) dW1[h * inputSize + j] += d1[h] * x[j]; db1[h] += d1[h]; } // Full-batch weight update after all N instances const sc = lr / N; for (let i = 0; i < W1.length; i++) W1[i] -= sc * dW1[i]; for (let i = 0; i < W2.length; i++) W2[i] -= sc * dW2[i];

Theory

Theorem 1.(Universal Approximation, Cybenko 1989; Hornik 1991) A feedforward network with a single hidden layer containing a finite number of neurons with a continuous, bounded, non-constant activation function can approximate any continuous function on a compact subset of

\mathbb{R}^d

to arbitrary precision.

Universal approximation guarantees existence but not learnability — gradient descent may converge to a poor local minimum, and the required number of hidden neurons may be exponentially large. In practice, depth (more layers) is far more parameter-efficient than width for complex functions.

Theorem 2.The combined gradient of cross-entropy loss with respect to softmax pre-activations is

\boldsymbol{\delta}_2 = \hat{\mathbf{y}} - \mathbf{e}_c

. This follows because the Jacobian of softmax cancels cleanly with the derivative of log-loss:

\frac{\partial \mathcal{L}}{\partial z_{2,c}} = \hat{y}_c - \mathbf{1}[c = c_{\text{true}}]

making the output gradient proportional to the prediction error — large when the network is wrong, near zero when it is confident and correct.

Glorot initialisation. Initialising weights uniformly in $\left[-\sqrt{6/(d_{\text{in}}+d_{\text{out}})},\;+\sqrt{6/(d_{\text{in}}+d_{\text{out}})}\right]$ keeps the variance of activations and gradients approximately constant across layers at the start of training, avoiding both vanishing and exploding gradients.

Complexity

Forward pass

O(N \cdot (dH + HC))

— N instances, d inputs, H hidden, C classes

Backward pass

O(N \cdot (dH + HC))

— same order as forward — two matrix-vector products

Per epoch

O(N \cdot H \cdot (d + C))

— full batch over all instances

Total training

O(E \cdot N \cdot H \cdot (d + C))

— E = 200 epochs default; scales linearly with data

Inference

O(dH + HC)

— single forward pass per instance

Notes

Full-batch vs mini-batch. This implementation uses full-batch gradient descent — gradients are accumulated over all N instances before each weight update. This is stable but slow on large datasets. Mini-batch SGD (batches of 32–256) is standard in practice and adds implicit regularisation through gradient noise.

No regularisation. There is no weight decay, dropout, or early stopping. On small datasets (like iris or contact-lenses) the network may overfit when used in training-set evaluation mode. Cross-validation gives a fairer accuracy estimate.

Sigmoid saturates. For large $|z|$ the sigmoid gradient approaches zero, slowing learning. The implementation clamps the sigmoid input to $[-500, 500]$ to avoid Math.exp overflow, but deep saturation is still possible without batch normalisation or ReLU activations.

Network visualisation. After running Neural Network (MLP) in the Explorer, switch to the Visualize tab and select Network to see the learned weight diagram. Edge colour encodes sign (red = positive, blue = negative) and opacity encodes magnitude.

hiddenSize

Number of hidden neurons H. Defaults to ⌊(d+C)/2⌋, minimum 2.

epochs

Training iterations over the full dataset. Default: 200.

Learning rate η for gradient descent. Default: 0.05.

On this page

Original Paper Architecture Forward Pass Backpropagation Theory → Code Theory Complexity Notes

Support Vector Machine Linear Regression

machinelearning.js.org · open source · MIT · Marin's Web Site