MachineLearning.js

Greedy Bayesian network structure learning from discrete data — Cooper & Herskovits (1992). Given a node ordering, K2 greedily finds the parent set for each node that maximises the Cooper-Herskovits score.

A Bayesian Method for the Induction of Probabilistic Networks from Data

G.F. Cooper & E. Herskovits · Machine Learning 9(4):309–347 · 1992

View paper →The paper that introduced the K2 algorithm and the Cooper-Herskovits scoring function

Algorithm

A Bayesian network is a DAG where each node represents a discrete random variable and each directed edge encodes a direct probabilistic dependency. Structure learning asks: given a dataset, which DAG best explains it?

The naive approach — evaluate all possible DAGs — is NP-hard. K2 makes it tractable with two constraints: (1) a fixed node ordering that restricts which nodes can be parents (earlier nodes only), and (2) a greedy hill-climbing search that adds one parent at a time, stopping when the score no longer improves.

Input: dataset D, node ordering σ = (X₁, …, X_n), max parents u

Output: parent sets π_i for each node X_i

for i = 1 to n do

π_i ← ∅; P_old ← f(X_i, ∅)

while |π_i| < u and ∃ improvement do

z* ← argmax_z f(X_i, π_i ∪ z) where z ∈ {σ[1..i-1]} \ π_i

if f(X_i, π_i ∪ z*) > P_old

P_old ← f(X_i, π_i ∪ z*); π_i ← π_i ∪ z*

else break

return {π_i}

Cooper-Herskovits Score

The score $f(X_i, \pi_i)$ is the marginal likelihood of the data given the structure, integrated over all possible CPT parameters with a uniform Dirichlet prior. For node $X_i$ with $r_i$ values, parent set $\pi_i$ , and $q_i = \prod_j |\pi_j|$ parent configurations:

Cooper-Herskovits scoring function

f(X_i, \pi_i) = \prod_{j=1}^{q_i} \frac{(r_i - 1)!}{(N_{ij} + r_i - 1)!} \prod_{k=1}^{r_i} \alpha_{ijk}!

where $N_{ij} = \sum_k \alpha_{ijk}$ is the count of instances with the $j$ -th parent configuration, and $\alpha_{ijk}$ is the count where additionally $X_i = v_k$ .

In log-space (avoiding factorial overflow):

Log score (used in practice)

\log f(X_i,\pi_i) = \sum_{j=1}^{q_i}\!\Bigl[\log\Gamma(r_i) - \log\Gamma(N_{ij}+r_i) + \sum_{k=1}^{r_i}\log\Gamma(\alpha_{ijk}+1)\Bigr]

Adding a parent either increases or decreases this score. K2 only adds a candidate parent when the score strictly improves — that is the hill-climbing stopping condition.

Theory → Code

Build a frequency table over the joint (node, parent config)

// For each training instance, compute a string key for the parent config // and increment the count for the observed node value. for (const row of instances) { const key = parentIdxs.map(pi => row[pi]).join('\0'); table[key] ??= {}; table[key][row[xi]] = (table[key][row[xi]] ?? 0) + 1; }

Evaluate the Cooper-Herskovits log score

for (const config of allConfigs(parentVals)) { const counts = table[config.join('\0')] ?? {}; const Nij = xiVals.reduce((s, v) => s + (counts[v] ?? 0), 0); // lgamma(r) − lgamma(Nij + r): normalisation term logScore += lgamma(r) - lgamma(Nij + r); // Σ_k lgamma(α_ijk + 1): observed-count terms for (const v of xiVals) logScore += lgamma((counts[v] ?? 0) + 1); }

Greedy parent search — add the best-scoring candidate

Theory

Theorem 1.(Cooper & Herskovits, §3) Under a uniform Dirichlet parameter prior and a Dirichlet-Markov condition (parameter independence across nodes and parent configurations), the marginal likelihood of data D given structure

B_S

factors as a product of per-node terms:

P(D \mid B_S) = \prod_{i=1}^{n} f(X_i, \pi_i)

making the K2 objective decomposable — scoring each node independently is globally valid.

Theorem 2.The greedy hill-climbing in K2 is not globally optimal. The ordering constraint removes all cycles but does not guarantee that the highest-scoring DAG consistent with the ordering is found — adding the locally best parent at each step may block a higher-scoring structure achievable by a different addition sequence. The search is polynomial in

O(n^2 \cdot u)

rather than the NP-hard full structure enumeration.

Complexity

Score eval

O(n \cdot 2^{|\pi|})

— count table over instances per (node, parent set) pair

Per-node search

O(n^2 \cdot u)

— up to u parents, scanning n candidates each step

Total

O(n^3 \cdot u)

— outer loop over n nodes

CPT size

O(r \cdot 2^u)

— exponential in parents — motivates the u cap

Notes

Ordering matters. K2 requires a topological ordering of the true network as input — if the ordering is wrong, K2 cannot recover the true structure even with infinite data. In practice, orderings are derived from domain knowledge or approximated with exhaustive search over orderings (which re-introduces NP-hardness).

maxParents cap. The Cooper-Herskovits score prefers more parents when data is limited (overfitting), so a cap prevents degenerate structures. Typical values are 2–4 depending on dataset size.

Try K2 on the eczema dataset — it has a known 7-node structure (GeneticRisk + IrritantProducts → BrokenSkinBarrier; GeneticRisk + DustMiteExposure + HighSugarDiet → Th2Dysregulation; both → EczemaFlare). Load it in the Explorer and compare the learned structure to the true DAG in the BN Builder.

K2 and the research arc. K2 was validated on the ALARM network — 37 nodes and 46 edges representing an anaesthesia monitoring system. This remains the canonical benchmark for BN structure learning algorithms.

Node indices in causal order. Earlier nodes can only be parents, not children.

Maximum parents per node. Caps CPT size and prevents overfitting on small datasets.

On this page

Original Paper Algorithm Cooper-Herskovits Score Theory → Code Theory Complexity Notes

k-Means Clustering ARFF Format

machinelearning.js.org · open source · MIT · Marin's Web Site