Welcome to the UC Irvine Machine Learning Repository
We currently maintain 674 datasets as a service to the machine learning community. Here, you can donate and find datasets used by millions of people all around the world!
Popular Datasets
Iris
A small classic dataset from Fisher, 1936. One of the earliest known datasets used for evaluating classification methods.
Heart Disease
4 databases: Cleveland, Hungary, Switzerland, and the VA Long Beach
Wine Quality
Two datasets are included, related to red and white vinho verde wine samples, from the north of Portugal. The goal is to model wine quality based on physicochemical tests (see [Cortez et al., 2009], http://www3.dsi.uminho.pt/pcortez/wine/).
Adult
Predict whether annual income of an individual exceeds $50K/yr based on census data. Also known as "Census Income" dataset.
Breast Cancer Wisconsin (Diagnostic)
Diagnostic Wisconsin Breast Cancer Database.
Bank Marketing
The data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe a term deposit (variable y).
New Datasets
Gas sensor array low-concentration
This dataset contains 6 gas responses collected by a sensor array consisting of 10 metal oxide semiconductor sensors, with gas concentrations at the ppb level (below the minimum detection limit of the sensors)
Twitter Geospatial Data
Seven days of geo-tagged Tweet data from the United States with exact GPS location and timestamp.
CAN-MIRGU
A Comprehensive CAN Bus Attack Dataset from Moving Vehicles for Intrusion Detection System Evaluation This dataset includes CAN bus attacks collected from a modern automobile equipped with autonomous driving capabilities, operating in real-world driving scenarios. The dataset encompasses physically verified attacks to enhance the comparison and validation of in-vehicle network Intrusion Detection Systems.
Assessing Mathematics Learning in Higher Education
MathE is a mathematical platform developed under the MathE project (mathe.pixel-online.org). The dataset has 9546 answers to questions in the Mathematical topics taught in higher education. The file has eight features, named: Student ID, Student Country, Question ID, Type of answer (correct or incorrect), Question level (basic or advanced), Math Topic, Math Subtopic, and Question Keywords. The question level was associated with the professor who submitted the question. The data was obtained from February 2019 until December 2023.
Turkish Crowdfunding Startups
This dataset contains data on crowdfunding campaigns in Turkey. The dataset includes various characteristics such as crowdfunding projects, project descriptions, targeted and raised funds, campaign durations, and number of backers. Collected in 2022, this dataset provides a valuable resource for researchers who want to understand and analyze the crowdfunding ecosystem in Turkey. In total, there are data from more than 1500 projects on 6 different platforms. The dataset is particularly useful for training natural language processing (NLP) and machine learning models. This dataset is an important reference point for studies on the characteristics of successful crowdfunding campaigns and provides comprehensive information for entrepreneurs, investors and researchers in Turkey.
Synthetic Circle Data Set
This dataset comprises 10000 two-dimensional points arranged into 100 circles, each containing 100 points. It was designed to evaluate clustering algorithms, such as k-means, by providing a clear and structured clustering challenge.