Neural Network (MLP)

A Multilayer Perceptron with one hidden layer trained by stochastic gradient descent and backpropagation — Rumelhart, Hinton & Williams (1986). Learns non-linear decision boundaries by composing sigmoid activations in the hidden layer with a softmax output.

Learning representations by back-propagating errors

D.E. Rumelhart, G.E. Hinton & R.J. Williams · Nature 323:533–536 · 1986

View paper →The paper that popularised backpropagation for multi-layer networks and launched the modern deep learning era

Architecture


The MLP implemented here has three layers: an input layer, one hidden layer, and an output layer. Each layer is fully connected to the next; there are no skip connections or recurrences.

Input encoding. Nominal attributes with exactly two values are encoded as a single binary input (0 or 1). Nominal attributes with more than two values are one-hot encoded — a kk-valued attribute produces kk inputs. Numeric attributes are min-max normalised to [0,1][0,1]. The class attribute is excluded from the input.

Hidden layer size. When not specified, the hidden size defaults to H=max ⁣(2,  (d+C)/2)H = \max\!\left(2,\;\bigl\lfloor (d + C) / 2 \bigr\rfloor\right), where dd is the number of input features after encoding and CC is the number of classes. This is a common rule of thumb that sits between the input and output dimensions.

Output layer. Each output neuron corresponds to one class. Raw logits are passed through softmax to produce a valid probability distribution over classes. The predicted class is the argmax of this distribution.

MLP architecture

Layer 0 (input): d neurons — encoded feature vector x ∈ ℝ^d
Layer 1 (hidden): H neurons — z₁ = W₁x + b₁, a₁ = σ(z₁)
Layer 2 (output): C neurons — z₂ = W₂a₁ + b₂, a₂ = softmax(z₂)
 
Parameters: W₁ ∈ ℝ^(H×d), b₁ ∈ ℝ^H, W₂ ∈ ℝ^(C×H), b₂ ∈ ℝ^C
Weights init: Glorot uniform — U[−√(6/(fan_in+fan_out)), +√(6/(fan_in+fan_out))]

Forward Pass


Given an encoded input vector xRd\mathbf{x} \in \mathbb{R}^d, the network computes predictions in two matrix-vector steps.

Hidden layer pre-activation and activation

z1=W1x+b1RH,a1=σ(z1)=11+ez1\mathbf{z}_1 = W_1\mathbf{x} + \mathbf{b}_1 \in \mathbb{R}^H, \qquad \mathbf{a}_1 = \sigma(\mathbf{z}_1) = \frac{1}{1+e^{-\mathbf{z}_1}}

Output layer pre-activation and softmax

z2=W2a1+b2RC,y^c=ez2,cj=1Cez2,j\mathbf{z}_2 = W_2\mathbf{a}_1 + \mathbf{b}_2 \in \mathbb{R}^C, \qquad \hat{y}_c = \frac{e^{z_{2,c}}}{\sum_{j=1}^{C} e^{z_{2,j}}}

The predicted class is c^=argmaxcy^c\hat{c} = \arg\max_c \hat{y}_c. Training minimises the cross-entropy loss summed over all training instances:

Cross-entropy loss

L=n=1Nlogy^cn\mathcal{L} = -\sum_{n=1}^{N} \log \hat{y}_{c_n}

where cnc_n is the true class index of instance nn.

Backpropagation


Gradients are computed by the chain rule, propagating the error signal from the output layer back through the hidden layer. The softmax + cross-entropy combination has a particularly clean combined derivative.

Output layer error (combined softmax + cross-entropy gradient)

δ2=y^ec\boldsymbol{\delta}_2 = \hat{\mathbf{y}} - \mathbf{e}_{c}

where ec\mathbf{e}_c is the one-hot vector for the true class. This is the gradient of the cross-entropy loss with respect to the pre-activation z2\mathbf{z}_2.

Output weight gradients

LW2=δ2a1,Lb2=δ2\frac{\partial \mathcal{L}}{\partial W_2} = \boldsymbol{\delta}_2 \mathbf{a}_1^\top, \qquad \frac{\partial \mathcal{L}}{\partial \mathbf{b}_2} = \boldsymbol{\delta}_2

Hidden layer error (backpropagated through sigmoid)

δ1=(W2δ2)σ(a1),σ(a)=a(1a)\boldsymbol{\delta}_1 = \left(W_2^\top \boldsymbol{\delta}_2\right) \odot \sigma'(\mathbf{a}_1), \qquad \sigma'(a) = a(1-a)

Hidden weight gradients

LW1=δ1x,Lb1=δ1\frac{\partial \mathcal{L}}{\partial W_1} = \boldsymbol{\delta}_1 \mathbf{x}^\top, \qquad \frac{\partial \mathcal{L}}{\partial \mathbf{b}_1} = \boldsymbol{\delta}_1

Weights are updated by full-batch gradient descent after accumulating gradients over allNN training instances:

Parameter update

WWηNWLW \leftarrow W - \frac{\eta}{N} \nabla_W \mathcal{L}

where η\eta is the learning rate. This implementation uses η=0.05\eta = 0.05 and runs for 200 epochs by default.

Theory → Code


1

Encode inputs — numeric normalisation and nominal one-hot

function encodeInstance(instance, encoders) { const x = []; for (let i = 0; i < encoders.length; i++) { const enc = encoders[i]; if (!enc) continue; // skip class attribute const v = instance[i]; if (enc.type === 'numeric') { x.push(enc.range > 0 ? ((v ?? enc.min) - enc.min) / enc.range : 0.5); } else if (enc.type === 'binary') { x.push(enc.vals.indexOf(v) === 1 ? 1 : 0); } else { // one-hot for k > 2 const idx = enc.vals.indexOf(v); enc.vals.forEach((_, j) => x.push(j === idx ? 1 : 0)); } } return x; }

2

Forward pass — sigmoid hidden, softmax output

// Hidden layer: a1[h] = σ(W1[h,·]·x + b1[h]) const a1 = Array.from({ length: H }, (_, h) => { let z = b1[h]; for (let j = 0; j < inputSize; j++) z += W1[h * inputSize + j] * x[j]; return sigmoid(z); }); // Output layer: a2 = softmax(W2·a1 + b2) const z2 = Array.from({ length: C }, (_, c) => { let z = b2[c]; for (let h = 0; h < H; h++) z += W2[c * H + h] * a1[h]; return z; }); const a2 = softmax(z2); // probability distribution over C classes

3

Backward pass — accumulate gradients, then update

// δ₂ = a₂ − e_c (combined softmax + cross-entropy derivative) const d2 = a2.map((v, c) => v - (c === ci ? 1 : 0)); // Accumulate output weight gradients for (let c = 0; c < C; c++) { for (let h = 0; h < H; h++) dW2[c * H + h] += d2[c] * a1[h]; db2[c] += d2[c]; } // Backpropagate through sigmoid: δ₁[h] = (Σ_c W2[c,h] δ₂[c]) · a₁[h](1−a₁[h]) const d1 = Array.from({ length: H }, (_, h) => { let s = 0; for (let c = 0; c < C; c++) s += W2[c * H + h] * d2[c]; return s * a1[h] * (1 - a1[h]); // sigmoidD }); for (let h = 0; h < H; h++) { for (let j = 0; j < inputSize; j++) dW1[h * inputSize + j] += d1[h] * x[j]; db1[h] += d1[h]; } // Full-batch weight update after all N instances const sc = lr / N; for (let i = 0; i < W1.length; i++) W1[i] -= sc * dW1[i]; for (let i = 0; i < W2.length; i++) W2[i] -= sc * dW2[i];

Theory


Theorem 1.(Universal Approximation, Cybenko 1989; Hornik 1991) A feedforward network with a single hidden layer containing a finite number of neurons with a continuous, bounded, non-constant activation function can approximate any continuous function on a compact subset of Rd\mathbb{R}^d to arbitrary precision.

Universal approximation guarantees existence but not learnability — gradient descent may converge to a poor local minimum, and the required number of hidden neurons may be exponentially large. In practice, depth (more layers) is far more parameter-efficient than width for complex functions.

Theorem 2.The combined gradient of cross-entropy loss with respect to softmax pre-activations isδ2=y^ec\boldsymbol{\delta}_2 = \hat{\mathbf{y}} - \mathbf{e}_c. This follows because the Jacobian of softmax cancels cleanly with the derivative of log-loss:
Lz2,c=y^c1[c=ctrue]\frac{\partial \mathcal{L}}{\partial z_{2,c}} = \hat{y}_c - \mathbf{1}[c = c_{\text{true}}]
making the output gradient proportional to the prediction error — large when the network is wrong, near zero when it is confident and correct.

Glorot initialisation. Initialising weights uniformly in [6/(din+dout),  +6/(din+dout)]\left[-\sqrt{6/(d_{\text{in}}+d_{\text{out}})},\;+\sqrt{6/(d_{\text{in}}+d_{\text{out}})}\right] keeps the variance of activations and gradients approximately constant across layers at the start of training, avoiding both vanishing and exploding gradients.

Complexity


Complexity

Forward pass

O(N(dH+HC))O(N \cdot (dH + HC))N instances, d inputs, H hidden, C classes

Backward pass

O(N(dH+HC))O(N \cdot (dH + HC))same order as forward — two matrix-vector products

Per epoch

O(NH(d+C))O(N \cdot H \cdot (d + C))full batch over all instances

Total training

O(ENH(d+C))O(E \cdot N \cdot H \cdot (d + C))E = 200 epochs default; scales linearly with data

Inference

O(dH+HC)O(dH + HC)single forward pass per instance

Notes


Full-batch vs mini-batch. This implementation uses full-batch gradient descent — gradients are accumulated over all N instances before each weight update. This is stable but slow on large datasets. Mini-batch SGD (batches of 32–256) is standard in practice and adds implicit regularisation through gradient noise.

No regularisation. There is no weight decay, dropout, or early stopping. On small datasets (like iris or contact-lenses) the network may overfit when used in training-set evaluation mode. Cross-validation gives a fairer accuracy estimate.

Sigmoid saturates. For large z|z| the sigmoid gradient approaches zero, slowing learning. The implementation clamps the sigmoid input to[500,500][-500, 500] to avoid Math.exp overflow, but deep saturation is still possible without batch normalisation or ReLU activations.

Network visualisation. After running Neural Network (MLP) in the Explorer, switch to the Visualize tab and select Network to see the learned weight diagram. Edge colour encodes sign (red = positive, blue = negative) and opacity encodes magnitude.

hiddenSize
Number of hidden neurons H. Defaults to ⌊(d+C)/2⌋, minimum 2.
epochs
Training iterations over the full dataset. Default: 200.
lr
Learning rate η for gradient descent. Default: 0.05.

machinelearning.js.org · open source · MIT · Marin's Web Site