ARFF Format

The Attribute-Relation File Format — Weka's native dataset format, supported directly by the Explorer.

Introduction


ARFF was developed by the Machine Learning Project at the University of Waikato for use with the Weka machine learning software. It is a human-readable ASCII text format that describes instances sharing a fixed set of attributes.

An ARFF file has two sections: the Header (relation name and attribute declarations) and the Data section (one instance per line, comma-separated).

Structure


% Comments start with % % This is the classic Iris dataset @relation iris @attribute sepallength numeric @attribute sepalwidth numeric @attribute petallength numeric @attribute petalwidth numeric @attribute class {Iris-setosa,Iris-versicolor,Iris-virginica} @data 5.1,3.5,1.4,0.2,Iris-setosa 4.9,3.0,1.4,0.2,Iris-setosa 7.0,3.2,4.7,1.4,Iris-versicolor ...

The @relation line names the dataset. Each @attribute line declares one attribute by name and type. The class attribute is conventionally last.

Attribute Types


numeric

Continuous real-valued attributes. May also be declared as 'real' or 'integer'.

{v1,v2,...}

Nominal (categorical) — lists all possible values in braces. Used for the class attribute.

string

Free-form string values. Treated as nominal in this tool.

date

Date/time values with an optional Java SimpleDateFormat pattern. Parsed as numeric timestamps.

For the full specification see the official Weka ARFF documentation.

Missing Values


Missing values are represented by a question mark ? in the data section. The Preprocess tab reports missing-value counts per attribute. During classification and clustering, missing values receive a maximum-distance penalty.

% Instance with a missing value in the second attribute: 5.1,?,1.4,0.2,Iris-setosa

machinelearning.js.org · open source · MIT · Marin's Web Site