K2 Algorithm

Greedy Bayesian network structure learning from discrete data — Cooper & Herskovits (1992). Given a node ordering, K2 greedily finds the parent set for each node that maximises the Cooper-Herskovits score.

A Bayesian Method for the Induction of Probabilistic Networks from Data

G.F. Cooper & E. Herskovits · Machine Learning 9(4):309–347 · 1992

View paper →The paper that introduced the K2 algorithm and the Cooper-Herskovits scoring function

Algorithm


A Bayesian network is a DAG where each node represents a discrete random variable and each directed edge encodes a direct probabilistic dependency. Structure learning asks: given a dataset, which DAG best explains it?

The naive approach — evaluate all possible DAGs — is NP-hard. K2 makes it tractable with two constraints: (1) a fixed node ordering that restricts which nodes can be parents (earlier nodes only), and (2) a greedy hill-climbing search that adds one parent at a time, stopping when the score no longer improves.

K2

Input: dataset D, node ordering σ = (X₁, …, X_n), max parents u
Output: parent sets π_i for each node X_i
 
for i = 1 to n do
π_i ← ∅; P_old ← f(X_i, ∅)
while |π_i| < u and ∃ improvement do
z* ← argmax_z f(X_i, π_i ∪ z) where z ∈ {σ[1..i-1]} \ π_i
if f(X_i, π_i ∪ z*) > P_old
P_old ← f(X_i, π_i ∪ z*); π_i ← π_i ∪ z*
else break
return {π_i}

Cooper-Herskovits Score


The score f(Xi,πi)f(X_i, \pi_i) is the marginal likelihood of the data given the structure, integrated over all possible CPT parameters with a uniform Dirichlet prior. For node XiX_i with rir_i values, parent set πi\pi_i, and qi=jπjq_i = \prod_j |\pi_j| parent configurations:

Cooper-Herskovits scoring function

f(Xi,πi)=j=1qi(ri1)!(Nij+ri1)!k=1riαijk!f(X_i, \pi_i) = \prod_{j=1}^{q_i} \frac{(r_i - 1)!}{(N_{ij} + r_i - 1)!} \prod_{k=1}^{r_i} \alpha_{ijk}!

where Nij=kαijkN_{ij} = \sum_k \alpha_{ijk} is the count of instances with the jj-th parent configuration, and αijk\alpha_{ijk} is the count where additionally Xi=vkX_i = v_k.

In log-space (avoiding factorial overflow):

Log score (used in practice)

logf(Xi,πi)=j=1qi ⁣[logΓ(ri)logΓ(Nij+ri)+k=1rilogΓ(αijk+1)]\log f(X_i,\pi_i) = \sum_{j=1}^{q_i}\!\Bigl[\log\Gamma(r_i) - \log\Gamma(N_{ij}+r_i) + \sum_{k=1}^{r_i}\log\Gamma(\alpha_{ijk}+1)\Bigr]

Adding a parent either increases or decreases this score. K2 only adds a candidate parent when the score strictly improves — that is the hill-climbing stopping condition.

Theory → Code


1

Build a frequency table over the joint (node, parent config)

// For each training instance, compute a string key for the parent config // and increment the count for the observed node value. for (const row of instances) { const key = parentIdxs.map(pi => row[pi]).join('\0'); table[key] ??= {}; table[key][row[xi]] = (table[key][row[xi]] ?? 0) + 1; }

2

Evaluate the Cooper-Herskovits log score

for (const config of allConfigs(parentVals)) { const counts = table[config.join('\0')] ?? {}; const Nij = xiVals.reduce((s, v) => s + (counts[v] ?? 0), 0); // lgamma(r) − lgamma(Nij + r): normalisation term logScore += lgamma(r) - lgamma(Nij + r); // Σ_k lgamma(α_ijk + 1): observed-count terms for (const v of xiVals) logScore += lgamma((counts[v] ?? 0) + 1); }

3

Greedy parent search — add the best-scoring candidate

let parents = [], P_old = chScore(xi, [], ds); while (parents.length < maxParents) { let best = -Infinity, bestZ = -1; // Only look at nodes earlier in the ordering (the K2 restriction) for (const z of ordering.slice(0, pos)) { if (parents.includes(z)) continue; const s = chScore(xi, [...parents, z], ds); if (s > best) { best = s; bestZ = z; } } if (bestZ < 0 || best <= P_old) break; // no improvement — stop P_old = best; parents.push(bestZ); }

Theory


Theorem 1.(Cooper & Herskovits, §3) Under a uniform Dirichlet parameter prior and a Dirichlet-Markov condition (parameter independence across nodes and parent configurations), the marginal likelihood of data D given structure BSB_S factors as a product of per-node terms:
P(DBS)=i=1nf(Xi,πi)P(D \mid B_S) = \prod_{i=1}^{n} f(X_i, \pi_i)
making the K2 objective decomposable — scoring each node independently is globally valid.
Theorem 2.The greedy hill-climbing in K2 is not globally optimal. The ordering constraint removes all cycles but does not guarantee that the highest-scoring DAG consistent with the ordering is found — adding the locally best parent at each step may block a higher-scoring structure achievable by a different addition sequence. The search is polynomial in O(n2u)O(n^2 \cdot u) rather than the NP-hard full structure enumeration.

Complexity


Complexity

Score eval

O(n2π)O(n \cdot 2^{|\pi|})count table over instances per (node, parent set) pair

Per-node search

O(n2u)O(n^2 \cdot u)up to u parents, scanning n candidates each step

Total

O(n3u)O(n^3 \cdot u)outer loop over n nodes

CPT size

O(r2u)O(r \cdot 2^u)exponential in parents — motivates the u cap

Notes


Ordering matters. K2 requires a topological ordering of the true network as input — if the ordering is wrong, K2 cannot recover the true structure even with infinite data. In practice, orderings are derived from domain knowledge or approximated with exhaustive search over orderings (which re-introduces NP-hardness).

maxParents cap. The Cooper-Herskovits score prefers more parents when data is limited (overfitting), so a cap prevents degenerate structures. Typical values are 2–4 depending on dataset size.

K2 and the research arc. K2 was validated on the ALARM network — 37 nodes and 46 edges representing an anaesthesia monitoring system. This remains the canonical benchmark for BN structure learning algorithms.

Node indices in causal order. Earlier nodes can only be parents, not children.
Maximum parents per node. Caps CPT size and prevents overfitting on small datasets.

machinelearning.js.org · open source · MIT · Marin's Web Site