Decision Tree (ID3)

Recursively splits the training data on the attribute that gives the highest information gain, building a tree that can be traversed at classify time.

Induction of Decision Trees

J.R. Quinlan · Machine Learning, Vol. 1 · 1986

View paper →Springer — institutional access may be required

Algorithm


ID3 (Iterative Dichotomiser 3) builds a decision tree top-down by greedily selecting, at each node, the attribute that maximises information gain — the reduction in entropy when the data is partitioned by that attribute.

Algorithm: ID3 BuildTree

BuildTree(SS, AA):
if all instances in SS share one class → return Leaf(class)
if A=A = \emptyset or max depth → return Leaf(majority class in SS)
aargmaxaAIG(S,a)a^* \leftarrow \arg\max_{a \in A}\, IG(S, a)
create node splitting on aa^*
for each value vv of aa^*:
Sv{xS:x[a]=v}S_v \leftarrow \{\,\mathbf{x} \in S : \mathbf{x}[a^*] = v\,\}
attach BuildTree(SvS_v, A{a}A \setminus \{a^*\}) for branch vv

Information Gain

IG(S,a)=H(S)vSvSH(Sv)IG(S,\, a) = H(S) - \sum_{v} \frac{|S_v|}{|S|} \cdot H(S_v)

Shannon Entropy

H(S)=cpclog2pcH(S) = -\sum_{c} p_c \log_2 p_c

Numeric attributes are handled with binary splits — the algorithm tries every midpoint between adjacent sorted values and picks the threshold with the highest information gain.

Theory → Code


Four implementation steps map to Quinlan's paper:

1

Entropy — measure impurity of a set of instances

function entropy(instances, classIndex) { const counts = {}; for (const inst of instances) { const c = inst[classIndex]; if (c !== null) counts[c] = (counts[c] || 0) + 1; } const n = Object.values(counts).reduce((s, v) => s + v, 0); return Object.values(counts).reduce((s, v) => { const p = v / n; return s - p * Math.log2(p); // H = -Σ p log₂p }, 0); }

2

Best split — find the threshold/value with maximum information gain

// Numeric: try every midpoint between adjacent sorted values for (let i = 0; i < sorted.length - 1; i++) { const t = (sorted[i][attrIdx] + sorted[i+1][attrIdx]) / 2; const gain = parentH - (left.length / n) * entropy(left, classIndex) - (right.length / n) * entropy(right, classIndex); if (gain > bestGain) { bestGain = gain; bestThreshold = t; } }

3

Recursive tree building — greedy top-down splitting

function buildTree(instances, attributes, classIndex, usedNominal, maxDepth, depth) { // Stop: pure node, depth limit, or no gain available const pure = new Set(instances.map(i => i[classIndex]).filter(v => v !== null)); if (pure.size <= 1 || depth >= maxDepth || instances.length <= 1) return { leaf: true, value: majorityClass(instances, classIndex) }; // Pick attribute with highest information gain let bestAttr = -1, bestGain = 0; for (let i = 0; i < attributes.length; i++) { const { gain } = bestSplit(instances, i, attributes[i], classIndex); if (gain > bestGain) { bestGain = gain; bestAttr = i; } } if (bestAttr === -1) return { leaf: true, value: majorityClass(instances, classIndex) }; // ... recurse into children }

4

Classify — traverse the tree by following branches

export function classify(model, instance) { let node = model.tree; while (!node.leaf) { const val = instance[node.attrIdx]; if (node.type === 'numeric') { node = (val !== null && val <= node.threshold) ? node.children.lte : node.children.gt; } else { node = node.children[val] ?? { leaf: true, value: node.fallback }; } } return node.value; }

Theory


Lemma 1.Information gain is non-negative: IG(S,a)0IG(S, a) \ge 0 for all attributes aa.
Proof sketch.Entropy HH is concave on probability distributions. By Jensen's inequality applied to the concave function HH:
H(S)    vSvSH(Sv)H(S) \;\ge\; \sum_{v} \frac{|S_v|}{|S|} H(S_v)
Therefore IG(S,a)=H(S)vSvSH(Sv)0IG(S, a) = H(S) - \sum_v \frac{|S_v|}{|S|}H(S_v) \ge 0.
Lemma 2.ID3 is guaranteed to reduce training error at each split, but provides no bound on generalisation error. The tree will overfit if grown without depth or leaf-size limits.

Complexity


Complexity

Training

O(ndlognh)O(n \cdot d \cdot \log n \cdot h)n instances, d attributes, h max depth

Query

O(h)O(h)single root-to-leaf traversal

Space

O(n)O(n)tree nodes proportional to training set

At each node, scoring all dd attributes over the remaining nn instances costs O(nd)O(n \cdot d); with O(logn)O(\log n) nodes per level and depth hh, total training is O(ndhlogn)O(n \cdot d \cdot h \cdot \log n). Numeric attributes add a O(nlogn)O(n \log n) sort per attribute per node.

Parameters


Max depth
Maximum tree depth (default 20). Shallower trees are more interpretable but may underfit.

Notes


Overfitting:ID3 grows trees until leaves are pure, which often overfits noisy training data. The max-depth limit is the only regularisation here. C4.5 (Quinlan's successor) adds post-pruning to address this.

Bias toward high-cardinality attributes: Information gain favours attributes with many values. C4.5 corrects this with Gain Ratio instead of raw gain.

machinelearning.js.org · open source · MIT · Marin's Web Site