Greedy Bayesian network structure learning from discrete data — Cooper & Herskovits (1992). Given a node ordering, K2 greedily finds the parent set for each node that maximises the Cooper-Herskovits score.
A Bayesian Method for the Induction of Probabilistic Networks from Data
G.F. Cooper & E. Herskovits · Machine Learning 9(4):309–347 · 1992
View paper →The paper that introduced the K2 algorithm and the Cooper-Herskovits scoring function
Algorithm
A Bayesian network is a DAG where each node represents a discrete random variable and each directed edge encodes a direct probabilistic dependency. Structure learning asks: given a dataset, which DAG best explains it?
The naive approach — evaluate all possible DAGs — is NP-hard. K2 makes it tractable with two constraints: (1) a fixed node ordering that restricts which nodes can be parents (earlier nodes only), and (2) a greedy hill-climbing search that adds one parent at a time, stopping when the score no longer improves.
K2
Input: dataset D, node ordering σ = (X₁, …, X_n), max parents u
Output: parent sets π_i for each node X_i
for i = 1 to n do
π_i ← ∅; P_old ← f(X_i, ∅)
while |π_i| < u and ∃ improvement do
z* ← argmax_z f(X_i, π_i ∪ z) where z ∈ {σ[1..i-1]} \ π_i
if f(X_i, π_i ∪ z*) > P_old
P_old ← f(X_i, π_i ∪ z*); π_i ← π_i ∪ z*
else break
return {π_i}
Cooper-Herskovits Score
The score f(Xi,πi) is the marginal likelihood of the data given the structure, integrated over all possible CPT parameters with a uniform Dirichlet prior. For node Xi with ri values, parent set πi, and qi=∏j∣πj∣ parent configurations:
Adding a parent either increases or decreases this score. K2 only adds a candidate parent when the score strictly improves — that is the hill-climbing stopping condition.
Theory → Code
1
Build a frequency table over the joint (node, parent config)
// For each training instance, compute a string key for the parent config
// and increment the count for the observed node value.
for (const row of instances) {
const key = parentIdxs.map(pi => row[pi]).join('\0');
table[key] ??= {};
table[key][row[xi]] = (table[key][row[xi]] ?? 0) + 1;
}
2
Evaluate the Cooper-Herskovits log score
for (const config of allConfigs(parentVals)) {
const counts = table[config.join('\0')] ?? {};
const Nij = xiVals.reduce((s, v) => s + (counts[v] ?? 0), 0);
// lgamma(r) − lgamma(Nij + r): normalisation term
logScore += lgamma(r) - lgamma(Nij + r);
// Σ_k lgamma(α_ijk + 1): observed-count terms
for (const v of xiVals) logScore += lgamma((counts[v] ?? 0) + 1);
}
3
Greedy parent search — add the best-scoring candidate
let parents = [], P_old = chScore(xi, [], ds);
while (parents.length < maxParents) {
let best = -Infinity, bestZ = -1;
// Only look at nodes earlier in the ordering (the K2 restriction)
for (const z of ordering.slice(0, pos)) {
if (parents.includes(z)) continue;
const s = chScore(xi, [...parents, z], ds);
if (s > best) { best = s; bestZ = z; }
}
if (bestZ < 0 || best <= P_old) break; // no improvement — stop
P_old = best;
parents.push(bestZ);
}
Theory
Theorem 1.(Cooper & Herskovits, §3) Under a uniform Dirichlet parameter prior and a Dirichlet-Markov condition (parameter independence across nodes and parent configurations), the marginal likelihood of data D given structure BS factors as a product of per-node terms:
P(D∣BS)=i=1∏nf(Xi,πi)
making the K2 objective decomposable — scoring each node independently is globally valid.
Theorem 2.The greedy hill-climbing in K2 is not globally optimal. The ordering constraint removes all cycles but does not guarantee that the highest-scoring DAG consistent with the ordering is found — adding the locally best parent at each step may block a higher-scoring structure achievable by a different addition sequence. The search is polynomial in O(n2⋅u) rather than the NP-hard full structure enumeration.
Complexity
Complexity
Score eval
O(n⋅2∣π∣)— count table over instances per (node, parent set) pair
Per-node search
O(n2⋅u)— up to u parents, scanning n candidates each step
Total
O(n3⋅u)— outer loop over n nodes
CPT size
O(r⋅2u)— exponential in parents — motivates the u cap
Notes
Ordering matters. K2 requires a topological ordering of the true network as input — if the ordering is wrong, K2 cannot recover the true structure even with infinite data. In practice, orderings are derived from domain knowledge or approximated with exhaustive search over orderings (which re-introduces NP-hardness).
maxParents cap. The Cooper-Herskovits score prefers more parents when data is limited (overfitting), so a cap prevents degenerate structures. Typical values are 2–4 depending on dataset size.
Try K2 on the eczema dataset — it has a known 7-node structure (GeneticRisk + IrritantProducts → BrokenSkinBarrier; GeneticRisk + DustMiteExposure + HighSugarDiet → Th2Dysregulation; both → EczemaFlare). Load it in the Explorer and compare the learned structure to the true DAG in the BN Builder.
K2 and the research arc. K2 was validated on the ALARM network — 37 nodes and 46 edges representing an anaesthesia monitoring system. This remains the canonical benchmark for BN structure learning algorithms.
Node indices in causal order. Earlier nodes can only be parents, not children.
Maximum parents per node. Caps CPT size and prevents overfitting on small datasets.