MachineLearning.js

Datasets marked in Explorerare available directly from the Explorer's Sample dropdown — no download needed. Others can be downloaded as ARFF and loaded via Open file or drag-and-drop.

ARFF files are served locally from this site or from Fordham University's archive. UCI links go to the original repository pages with full documentation.

Book Examples

These datasets appear throughout Data Mining: Practical Machine Learning Tools and Techniques as running examples.

Iris

150 instances

4 attrs

Classification

in Explorer

The most famous dataset in machine learning. Measurements of sepal and petal length and width for three iris species (setosa, versicolor, virginica), 50 instances each. Introduced by R.A. Fisher in 1936.

↓ ARFF UCI Repository →

Weather (Nominal)

14 instances

4 attrs

Classification

in Explorer

The canonical toy dataset from the Weka book. Predicts whether conditions are suitable to play golf based on outlook, temperature, humidity, and wind. Used throughout the book to illustrate decision trees and Naïve Bayes.

↓ ARFF

Contact Lenses

24 instances

4 attrs

Classification

in Explorer

All possible combinations of four nominal attributes for recommending soft, hard, or no contact lenses. Small but complete — no missing values.

↓ ARFF UCI Repository →

Labor Relations

57 instances

16 attrs

Classification

in Explorer

Final settlements from Canadian labor negotiations (1987–1988). Mixed numeric and nominal attributes covering wages, hours, pension, and leave. Classifies contracts as acceptable or not.

↓ ARFF UCI Repository →

UCI Classics

Widely used benchmarks from the UCI Machine Learning Repository.

Pima Indians Diabetes

768 instances

8 attrs

Classification

in Explorer

Medical records from the National Institute of Diabetes for Pima Indian women aged 21+. Attributes include glucose concentration, blood pressure, BMI, insulin level, and diabetes pedigree function.

↓ ARFF UCI Repository →

Congressional Voting Records

435 instances

16 attrs

Classification

in Explorer

1984 U.S. House of Representatives voting records on 16 key issues. Each instance is a member of Congress classified as Democrat or Republican.

↓ ARFF UCI Repository →

Glass Identification

214 instances

9 attrs

Classification

in Explorer

Chemical composition measurements for 214 glass samples from crime scene investigations. Six glass types including float/non-float window glass, containers, tableware, and headlamps.

↓ ARFF UCI Repository →

Ionosphere

351 instances

34 attrs

Classification

in Explorer

Radar returns from a phased array of 16 HF antennas targeting free electrons in the ionosphere. Returns classified as "good" (evidence of structure) or "bad" (pass-throughs). 34 continuous attributes.

↓ ARFF UCI Repository →

Image Segmentation

210 instances

19 attrs

Classification

in Explorer

Each instance is a 3×3-pixel region drawn from seven outdoor images. 19 continuous attributes describe spectral and geometric properties. Seven classes: brickface, sky, foliage, cement, window, path, grass.

↓ ARFF UCI Repository →

Adult (Census Income)

48,842 instances

14 attrs

Classification

Extracted from the 1994 U.S. Census Bureau database. Predicts whether annual income exceeds $50,000 based on age, education, occupation, marital status, and hours worked. Widely used for bias and fairness research.

↓ ARFF UCI Repository →

German Credit (credit-g)

1,000 instances

20 attrs

Classification

Credit risk classification for applicants at a German bank. Mixed attributes cover credit history, loan purpose, employment, and savings. Misclassifying a bad risk as good is five times more costly.

↓ ARFF UCI Repository →

Hypothyroid

3,772 instances

30 attrs

Classification

Patient records for diagnosing thyroid disorders. Seven continuous attributes (TSH, T3, TT4, T4U, FTI, age) and 23 nominal attributes. Four classes: negative, compensated, primary, and secondary hypothyroid.

↓ ARFF UCI Repository →

Bayesian Networks

Synthetic datasets generated from known Bayesian Network structures. Use these with the K2 algorithm to learn structure from data and compare the learned DAG against the true network in the BN Builder.

Eczema / Atopic Dermatitis

500 instances

7 attrs

Classification

in Explorer

Synthetic dataset forward-sampled from a 7-node Bayesian Network with a known causal structure: GeneticRisk + IrritantProducts → BrokenSkinBarrier; GeneticRisk + DustMiteExposure + HighSugarDiet → Th2Dysregulation; both → EczemaFlare. Designed for K2 structure learning — the true DAG is recoverable. ~25% positive (flare) class rate.

↓ ARFF

On this page

Introduction Book Examples UCI Classics Bayesian Networks

CSV Format

machinelearning.js.org · open source · MIT · Marin's Web Site