A Multilayer Perceptron with one hidden layer trained by stochastic gradient descent and backpropagation — Rumelhart, Hinton & Williams (1986). Learns non-linear decision boundaries by composing sigmoid activations in the hidden layer with a softmax output.
Learning representations by back-propagating errors
View paper →The paper that popularised backpropagation for multi-layer networks and launched the modern deep learning era
Architecture
The MLP implemented here has three layers: an input layer, one hidden layer, and an output layer. Each layer is fully connected to the next; there are no skip connections or recurrences.
Input encoding. Nominal attributes with exactly two values are encoded as a single binary input (0 or 1). Nominal attributes with more than two values are one-hot encoded — a k-valued attribute produces k inputs. Numeric attributes are min-max normalised to [0,1]. The class attribute is excluded from the input.
Hidden layer size. When not specified, the hidden size defaults to H=max(2,⌊(d+C)/2⌋), where d is the number of input features after encoding and C is the number of classes. This is a common rule of thumb that sits between the input and output dimensions.
Output layer. Each output neuron corresponds to one class. Raw logits are passed through softmax to produce a valid probability distribution over classes. The predicted class is the argmax of this distribution.
MLP architecture
Layer 0 (input): d neurons — encoded feature vector x ∈ ℝ^d
Given an encoded input vector x∈Rd, the network computes predictions in two matrix-vector steps.
Hidden layer pre-activation and activation
z1=W1x+b1∈RH,a1=σ(z1)=1+e−z11
Output layer pre-activation and softmax
z2=W2a1+b2∈RC,y^c=∑j=1Cez2,jez2,c
The predicted class is c^=argmaxcy^c. Training minimises the cross-entropy loss summed over all training instances:
Cross-entropy loss
L=−n=1∑Nlogy^cn
where cn is the true class index of instance n.
Backpropagation
Gradients are computed by the chain rule, propagating the error signal from the output layer back through the hidden layer. The softmax + cross-entropy combination has a particularly clean combined derivative.
where ec is the one-hot vector for the true class. This is the gradient of the cross-entropy loss with respect to the pre-activation z2.
Output weight gradients
∂W2∂L=δ2a1⊤,∂b2∂L=δ2
Hidden layer error (backpropagated through sigmoid)
δ1=(W2⊤δ2)⊙σ′(a1),σ′(a)=a(1−a)
Hidden weight gradients
∂W1∂L=δ1x⊤,∂b1∂L=δ1
Weights are updated by full-batch gradient descent after accumulating gradients over allN training instances:
Parameter update
W←W−Nη∇WL
where η is the learning rate. This implementation uses η=0.05 and runs for 200 epochs by default.
Theory → Code
1
Encode inputs — numeric normalisation and nominal one-hot
function encodeInstance(instance, encoders) {
const x = [];
for (let i = 0; i < encoders.length; i++) {
const enc = encoders[i];
if (!enc) continue; // skip class attribute
const v = instance[i];
if (enc.type === 'numeric') {
x.push(enc.range > 0 ? ((v ?? enc.min) - enc.min) / enc.range : 0.5);
} else if (enc.type === 'binary') {
x.push(enc.vals.indexOf(v) === 1 ? 1 : 0);
} else { // one-hot for k > 2
const idx = enc.vals.indexOf(v);
enc.vals.forEach((_, j) => x.push(j === idx ? 1 : 0));
}
}
return x;
}
2
Forward pass — sigmoid hidden, softmax output
// Hidden layer: a1[h] = σ(W1[h,·]·x + b1[h])
const a1 = Array.from({ length: H }, (_, h) => {
let z = b1[h];
for (let j = 0; j < inputSize; j++) z += W1[h * inputSize + j] * x[j];
return sigmoid(z);
});
// Output layer: a2 = softmax(W2·a1 + b2)
const z2 = Array.from({ length: C }, (_, c) => {
let z = b2[c];
for (let h = 0; h < H; h++) z += W2[c * H + h] * a1[h];
return z;
});
const a2 = softmax(z2); // probability distribution over C classes
3
Backward pass — accumulate gradients, then update
// δ₂ = a₂ − e_c (combined softmax + cross-entropy derivative)
const d2 = a2.map((v, c) => v - (c === ci ? 1 : 0));
// Accumulate output weight gradients
for (let c = 0; c < C; c++) {
for (let h = 0; h < H; h++) dW2[c * H + h] += d2[c] * a1[h];
db2[c] += d2[c];
}
// Backpropagate through sigmoid: δ₁[h] = (Σ_c W2[c,h] δ₂[c]) · a₁[h](1−a₁[h])
const d1 = Array.from({ length: H }, (_, h) => {
let s = 0;
for (let c = 0; c < C; c++) s += W2[c * H + h] * d2[c];
return s * a1[h] * (1 - a1[h]); // sigmoidD
});
for (let h = 0; h < H; h++) {
for (let j = 0; j < inputSize; j++) dW1[h * inputSize + j] += d1[h] * x[j];
db1[h] += d1[h];
}
// Full-batch weight update after all N instances
const sc = lr / N;
for (let i = 0; i < W1.length; i++) W1[i] -= sc * dW1[i];
for (let i = 0; i < W2.length; i++) W2[i] -= sc * dW2[i];
Theory
Theorem 1.(Universal Approximation, Cybenko 1989; Hornik 1991) A feedforward network with a single hidden layer containing a finite number of neurons with a continuous, bounded, non-constant activation function can approximate any continuous function on a compact subset of Rd to arbitrary precision.
Universal approximation guarantees existence but not learnability — gradient descent may converge to a poor local minimum, and the required number of hidden neurons may be exponentially large. In practice, depth (more layers) is far more parameter-efficient than width for complex functions.
Theorem 2.The combined gradient of cross-entropy loss with respect to softmax pre-activations isδ2=y^−ec. This follows because the Jacobian of softmax cancels cleanly with the derivative of log-loss:
∂z2,c∂L=y^c−1[c=ctrue]
making the output gradient proportional to the prediction error — large when the network is wrong, near zero when it is confident and correct.
Glorot initialisation. Initialising weights uniformly in [−6/(din+dout),+6/(din+dout)] keeps the variance of activations and gradients approximately constant across layers at the start of training, avoiding both vanishing and exploding gradients.
Complexity
Complexity
Forward pass
O(N⋅(dH+HC))— N instances, d inputs, H hidden, C classes
Backward pass
O(N⋅(dH+HC))— same order as forward — two matrix-vector products
Per epoch
O(N⋅H⋅(d+C))— full batch over all instances
Total training
O(E⋅N⋅H⋅(d+C))— E = 200 epochs default; scales linearly with data
Inference
O(dH+HC)— single forward pass per instance
Notes
Full-batch vs mini-batch. This implementation uses full-batch gradient descent — gradients are accumulated over all N instances before each weight update. This is stable but slow on large datasets. Mini-batch SGD (batches of 32–256) is standard in practice and adds implicit regularisation through gradient noise.
No regularisation. There is no weight decay, dropout, or early stopping. On small datasets (like iris or contact-lenses) the network may overfit when used in training-set evaluation mode. Cross-validation gives a fairer accuracy estimate.
Sigmoid saturates. For large ∣z∣ the sigmoid gradient approaches zero, slowing learning. The implementation clamps the sigmoid input to[−500,500] to avoid Math.exp overflow, but deep saturation is still possible without batch normalisation or ReLU activations.
Network visualisation. After running Neural Network (MLP) in the Explorer, switch to the Visualize tab and select Network to see the learned weight diagram. Edge colour encodes sign (red = positive, blue = negative) and opacity encodes magnitude.
hiddenSize
Number of hidden neurons H. Defaults to ⌊(d+C)/2⌋, minimum 2.
epochs
Training iterations over the full dataset. Default: 200.
lr
Learning rate η for gradient descent. Default: 0.05.