Welcome to the UC Irvine Machine Learning Repository

We currently maintain 674 datasets as a service to the machine learning community. Here, you can donate and find datasets used by millions of people all around the world!

Popular Datasets

Iris

A small classic dataset from Fisher, 1936. One of the earliest known datasets used for evaluating classification methods.

Heart Disease

4 databases: Cleveland, Hungary, Switzerland, and the VA Long Beach

Wine Quality

Two datasets are included, related to red and white vinho verde wine samples, from the north of Portugal. The goal is to model wine quality based on physicochemical tests (see [Cortez et al., 2009], http://www3.dsi.uminho.pt/pcortez/wine/).

Adult

Predict whether annual income of an individual exceeds $50K/yr based on census data. Also known as "Census Income" dataset.

Breast Cancer Wisconsin (Diagnostic)

Diagnostic Wisconsin Breast Cancer Database.

Bank Marketing

The data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe a term deposit (variable y).

See More Popular Datasets

New Datasets

Gas sensor array low-concentration

This dataset contains 6 gas responses collected by a sensor array consisting of 10 metal oxide semiconductor sensors, with gas concentrations at the ppb level (below the minimum detection limit of the sensors)

Twitter Geospatial Data

Seven days of geo-tagged Tweet data from the United States with exact GPS location and timestamp.

CAN-MIRGU

A Comprehensive CAN Bus Attack Dataset from Moving Vehicles for Intrusion Detection System Evaluation This dataset includes CAN bus attacks collected from a modern automobile equipped with autonomous driving capabilities, operating in real-world driving scenarios. The dataset encompasses physically verified attacks to enhance the comparison and validation of in-vehicle network Intrusion Detection Systems.

Assessing Mathematics Learning in Higher Education

MathE is a mathematical platform developed under the MathE project (mathe.pixel-online.org). The dataset has 9546 answers to questions in the Mathematical topics taught in higher education. The file has eight features, named: Student ID, Student Country, Question ID, Type of answer (correct or incorrect), Question level (basic or advanced), Math Topic, Math Subtopic, and Question Keywords. The question level was associated with the professor who submitted the question. The data was obtained from February 2019 until December 2023.

Turkish Crowdfunding Startups

This dataset contains data on crowdfunding campaigns in Turkey. The dataset includes various characteristics such as crowdfunding projects, project descriptions, targeted and raised funds, campaign durations, and number of backers. Collected in 2022, this dataset provides a valuable resource for researchers who want to understand and analyze the crowdfunding ecosystem in Turkey. In total, there are data from more than 1500 projects on 6 different platforms. The dataset is particularly useful for training natural language processing (NLP) and machine learning models. This dataset is an important reference point for studies on the characteristics of successful crowdfunding campaigns and provides comprehensive information for entrepreneurs, investors and researchers in Turkey.

Synthetic Circle Data Set

This dataset comprises 10000 two-dimensional points arranged into 100 circles, each containing 100 points. It was designed to evaluate clustering algorithms, such as k-means, by providing a clear and structured clustering challenge.

See More New Datasets

By using the UCI Machine Learning Repository, you acknowledge and accept the cookies and privacy practices used by the UCI Machine Learning Repository.

Read Policy