Welcome to the UC Irvine Machine Learning Repository

We currently maintain 682 datasets as a service to the machine learning community. Here, you can donate and find datasets used by millions of people all around the world!

Popular Datasets

Iris

A small classic dataset from Fisher, 1936. One of the earliest known datasets used for evaluating classification methods.

Heart Disease

4 databases: Cleveland, Hungary, Switzerland, and the VA Long Beach

Wine Quality

Two datasets are included, related to red and white vinho verde wine samples, from the north of Portugal. The goal is to model wine quality based on physicochemical tests (see [Cortez et al., 2009], http://www3.dsi.uminho.pt/pcortez/wine/).

Breast Cancer Wisconsin (Diagnostic)

Diagnostic Wisconsin Breast Cancer Database.

Adult

Predict whether annual income of an individual exceeds $50K/yr based on census data. Also known as "Census Income" dataset.

Bank Marketing

The data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe a term deposit (variable y).

See More Popular Datasets

New Datasets

High-Resolution Load Dataset from Smart Meters Across Various Cities in Morocco

The dataset includes detailed measurements of electricity consumption from several areas of Laâyoune, Boujdour, Marrakech and Foum Eloued. It presents both domestic and industrial consumption profiles with a frequency of recording every 10 minutes for Laâyoune Boujdour and Foum Eloued and every 30 minutes for Marrakech. The data are organized in several tables showing usage trends by city and area, which allows analysis of daily variations in demand, seasonal fluctuations and peak loads. An important element to The dataset includes detailed measurements of electricity consumption from several areas of Laâyoune, Boujdour, Marrakech and Foum Eloued. It presents both domestic and industrial consumption profiles with a frequency of recording every 10 minutes for Laâyoune Boujdour and Foum Eloued and every 30 minutes for Marrakech. All data is in amperes (A), except for Marrakech, which is in kilowatts (kW).

Gallstone

The clinical dataset was collected from the Internal Medicine Outpatient Clinic of Ankara VM Medical Park Hospital and includes data from 319 individuals (June 2022–June 2023), 161 of whom were diagnosed with gallstone disease. It contains 38 features, including demographic, bioimpedance, and laboratory data, and was ethically approved by the Ankara City Hospital Ethics Committee (E2-23-4632). Demographic variables are age, sex, height, weight, and BMI. Bioimpedance data includes total, extracellular, and intracellular water, muscle and fat mass, protein, visceral fat area, and hepatic fat. Laboratory features are glucose, total cholesterol, HDL, LDL, triglycerides, AST, ALT, ALP, creatinine, GFR, CRP, hemoglobin, and vitamin D. The dataset is complete, with no missing values, and balanced in terms of disease status, eliminating the need for additional preprocessing. It provides a strong foundation for machine learning-based gallstone prediction using non-imaging features.

BEED: Bangalore EEG Epilepsy Dataset

The Bangalore EEG Epilepsy Dataset (BEED) is a comprehensive EEG collection for epileptic seizure detection and classification. Recorded at a neurological research centre in Bangalore, India, it features high-fidelity EEG signals captured using the standard 10-20 electrode system at a 256 Hz sampling rate. BEED contains 16,000 segments of 20-second EEG recordings evenly distributed across four categories: Healthy Subjects (0), Generalized Seizures (1), Focal Seizures (2), and Seizure Events (3), where seizure activity occurs with events like eye blinking, nail biting, or staring. Each category includes data from 20 adult subjects (ages 21-55) with equal gender representation. The dataset comprises 16 EEG channels (X1-X16) corresponding to different brain regions, with a binary label (y) indicating seizure presence (1) or absence (0). BEED supports machine learning in seizure detection, epilepsy analysis, and EEG research with its balanced, high-resolution data.

RecGym: Gym Workouts Recognition Dataset with IMU and Capacitive Sensor

The RecGym dataset is a collection of gym workouts with IMU and Capacitive sensors, designed for research and development in recommendation systems and fitness applications. The data set records ten volunteers' gym sessions with a sensing unit composed of an IMU sensor (columns of A_x, A_y, A_z, G_x, G_y, G_z) and a Body Capacitance sensor (column of C_1). The sensing units were worn at three positions: on the wrist, in the pocket, and on the calf, with a sampling rate of 20 Hz. The data set contains the motion signals of twelve activities, including eleven workouts: Adductor, ArmCurl, BenchPress, LegCurl, LegPress, Riding, RopeSkipping, Running, Squat, StairsClimber, Walking, and a "Null" activity when the volunteer hangs around between different workouts session. Each participant performed the above-listed workouts for five sessions in five days (each session lasts around one hour). Altogether, fifty sessions of normalized gym workout data are presented in this data set.

Inflation Research Abstracts Classification

This data set contains scientific papers abstracts from economics inflation. The task is to classify them according to their machine learning methodologies inclusion.

Drug Induced Autoimmunity Prediction

This dataset comprises molecular descriptors generated using RDKit, specifically curated for the study of drug-induced autoimmunity through ensemble machine learning approaches. It is divided into a training set and a testing set, containing numerical features that represent molecular properties and structural characteristics of drugs. The dataset supports predictive modeling tasks aimed at identifying potential autoimmune risks associated with drug candidates. These molecular descriptors include physicochemical properties, providing a comprehensive foundation for machine learning analysis. The dataset facilitates the development of interpretable models for drug toxicity prediction, contributing to advancements in computational toxicology and drug safety assessment.

See More New Datasets

By using the UCI Machine Learning Repository, you acknowledge and accept the cookies and privacy practices used by the UCI Machine Learning Repository.

Read Policy