Sample Datasets

Classic benchmarks from the UCI Machine Learning Repository, used throughout the Witten & Frank book.

Datasets marked in Explorerare available directly from the Explorer's Sample dropdown — no download needed. Others can be downloaded as ARFF and loaded via Open file or drag-and-drop.

ARFF files are served locally from this site or from Fordham University's archive. UCI links go to the original repository pages with full documentation.

Book Examples


These datasets appear throughout Data Mining: Practical Machine Learning Tools and Techniques as running examples.

Iris

150 instances
4 attrs
Classification
in Explorer

The most famous dataset in machine learning. Measurements of sepal and petal length and width for three iris species (setosa, versicolor, virginica), 50 instances each. Introduced by R.A. Fisher in 1936.

Weather (Nominal)

14 instances
4 attrs
Classification
in Explorer

The canonical toy dataset from the Weka book. Predicts whether conditions are suitable to play golf based on outlook, temperature, humidity, and wind. Used throughout the book to illustrate decision trees and Naïve Bayes.

Contact Lenses

24 instances
4 attrs
Classification
in Explorer

All possible combinations of four nominal attributes for recommending soft, hard, or no contact lenses. Small but complete — no missing values.

Labor Relations

57 instances
16 attrs
Classification
in Explorer

Final settlements from Canadian labor negotiations (1987–1988). Mixed numeric and nominal attributes covering wages, hours, pension, and leave. Classifies contracts as acceptable or not.

UCI Classics


Widely used benchmarks from the UCI Machine Learning Repository.

Pima Indians Diabetes

768 instances
8 attrs
Classification
in Explorer

Medical records from the National Institute of Diabetes for Pima Indian women aged 21+. Attributes include glucose concentration, blood pressure, BMI, insulin level, and diabetes pedigree function.

Congressional Voting Records

435 instances
16 attrs
Classification
in Explorer

1984 U.S. House of Representatives voting records on 16 key issues. Each instance is a member of Congress classified as Democrat or Republican.

Glass Identification

214 instances
9 attrs
Classification
in Explorer

Chemical composition measurements for 214 glass samples from crime scene investigations. Six glass types including float/non-float window glass, containers, tableware, and headlamps.

Ionosphere

351 instances
34 attrs
Classification
in Explorer

Radar returns from a phased array of 16 HF antennas targeting free electrons in the ionosphere. Returns classified as "good" (evidence of structure) or "bad" (pass-throughs). 34 continuous attributes.

Image Segmentation

210 instances
19 attrs
Classification
in Explorer

Each instance is a 3×3-pixel region drawn from seven outdoor images. 19 continuous attributes describe spectral and geometric properties. Seven classes: brickface, sky, foliage, cement, window, path, grass.

Adult (Census Income)

48,842 instances
14 attrs
Classification

Extracted from the 1994 U.S. Census Bureau database. Predicts whether annual income exceeds $50,000 based on age, education, occupation, marital status, and hours worked. Widely used for bias and fairness research.

German Credit (credit-g)

1,000 instances
20 attrs
Classification

Credit risk classification for applicants at a German bank. Mixed attributes cover credit history, loan purpose, employment, and savings. Misclassifying a bad risk as good is five times more costly.

Hypothyroid

3,772 instances
30 attrs
Classification

Patient records for diagnosing thyroid disorders. Seven continuous attributes (TSH, T3, TT4, T4U, FTI, age) and 23 nominal attributes. Four classes: negative, compensated, primary, and secondary hypothyroid.

Bayesian Networks


Synthetic datasets generated from known Bayesian Network structures. Use these with the K2 algorithm to learn structure from data and compare the learned DAG against the true network in the BN Builder.

Eczema / Atopic Dermatitis

500 instances
7 attrs
Classification
in Explorer

Synthetic dataset forward-sampled from a 7-node Bayesian Network with a known causal structure: GeneticRisk + IrritantProducts → BrokenSkinBarrier; GeneticRisk + DustMiteExposure + HighSugarDiet → Th2Dysregulation; both → EczemaFlare. Designed for K2 structure learning — the true DAG is recoverable. ~25% positive (flare) class rate.

machinelearning.js.org · open source · MIT · Marin's Web Site