Naïve Bayes

Applies Bayes' theorem under conditional independence: Gaussian for numeric attributes, Laplace-smoothed counts for nominal.

Estimating Continuous Distributions in Bayesian Classifiers

G.H. John & P. Langley · UAI-95 — 11th Conference on Uncertainty in Artificial Intelligence · 1995

View paper →Free PDF on arXiv

Algorithm


Naïve Bayes applies Bayes' theorem with the assumption that all attributes are conditionally independent given the class. Despite this rarely being true in practice, it often performs surprisingly well — especially on text and high-dimensional data.

Bayes Rule (conditional independence)

P(cx)P(c)iP(xic)P(c \mid \mathbf{x}) \propto P(c)\prod_{i} P(x_i \mid c)

Numeric — Gaussian likelihood (John & Langley §3)

P(xic)=12πσ2exp ⁣((xμ)22σ2)P(x_i \mid c) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp\!\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)

Nominal — Laplace-smoothed counts

P(xi=vc)=#(v,c)+1#(c)+ViP(x_i = v \mid c) = \frac{\#(v,\,c) + 1}{\#(c) + |V_i|}

John & Langley's key contribution is extending the classic discrete Naïve Bayes model to handle continuous attributes with Gaussian density estimation, making it practical for real-world datasets with mixed attribute types.

Theory → Code


Four steps map John & Langley's formulation to the implementation:

1

Compute class priors P(class) from training frequencies

for (const cv of classValues) { const subset = instances.filter(i => i[classIndex] === cv); priors[cv] = subset.length / instances.length; // MLE prior }

2

Gaussian likelihood for numeric attributes (John & Langley §3)

// Estimate μ and σ² per (class, attribute) pair const mean = vals.reduce((s, v) => s + v, 0) / (vals.length || 1); const variance = vals.reduce((s, v) => s + (v - mean) ** 2, 0) / (vals.length || 1) || 1e-9; // floor prevents div-by-zero likelihoods[cv][i] = { mean, variance }; // At classify time — log-form avoids floating-point underflow: function gaussianLog(x, mean, variance) { return -0.5 * Math.log(2 * Math.PI * variance) - ((x - mean) ** 2) / (2 * variance); }

3

Laplace-smoothed counts for nominal attributes

// Seed every nominal value with count=1 before observing any data. // Prevents P(unseen value | class) = 0 from wiping out the posterior. const counts = {}; for (const v of attrVals) counts[v] = 1; // Laplace prior for (const v of vals) counts[v] = (counts[v] || 1) + 1; likelihoods[cv][i] = { counts, total: vals.length + attrVals.length }; // At classify time: logP += Math.log((lk.counts[instance[i]] || 1) / lk.total);

4

Classify in log-probability space — prevents floating-point underflow

let logP = Math.log(priors[cv]); // log P(class) for (let i = 0; i < attributes.length; i++) { if (attributes[i].type === 'numeric') logP += gaussianLog(instance[i], lk.mean, lk.variance); else logP += Math.log((lk.counts[instance[i]] || 1) / lk.total); } if (logP > bestScore) { bestScore = logP; best = cv; } // argmax

Theory


Theorem 1.(Optimality under independence) When the class-conditional distributions are truly independent, Naïve Bayes achieves Bayes-optimal classification. Formally, if P(xc)=iP(xic)P(\mathbf{x} \mid c) = \prod_i P(x_i \mid c) exactly, then the MAP classifier argmaxcP(cx)\arg\max_c P(c \mid \mathbf{x}) is the Bayes classifier.
Theorem 2.(Domingos & Pazzani, 1997) Even when the independence assumption is violated, Naïve Bayes achieves the Bayes-optimal decision boundary under milder conditions than those required for correct probability estimates — it needs only the correct ranking of class posteriors, not their calibrated values.

Complexity


Complexity

Training

O(nd)O(n \cdot d)single pass: compute priors, means, variances, counts

Query

O(cd)O(c \cdot d)score each of c classes over d attributes

Space

O(cd)O(c \cdot d)one Gaussian or count table per (class, attribute) pair

Notes


Laplace smoothing prevents zero-probability issues for nominal values not seen during training for a given class. Without it, a single unseen value drives the entire posterior to zero regardless of all other evidence.

The variance floor (1e-9) prevents division by zero in the Gaussian formula when all training instances in a class share exactly the same value for a numeric attribute.

machinelearning.js.org · open source · MIT · Marin's Web Site