Welcome to the UC Irvine Machine Learning Repository
We currently maintain 670 datasets as a service to the machine learning community. Here, you can donate and find datasets used by millions of people all around the world!
Popular Datasets
Iris
A small classic dataset from Fisher, 1936. One of the earliest known datasets used for evaluating classification methods.
Heart Disease
4 databases: Cleveland, Hungary, Switzerland, and the VA Long Beach
Wine Quality
Two datasets are included, related to red and white vinho verde wine samples, from the north of Portugal. The goal is to model wine quality based on physicochemical tests (see [Cortez et al., 2009], http://www3.dsi.uminho.pt/pcortez/wine/).
Adult
Predict whether annual income of an individual exceeds $50K/yr based on census data. Also known as "Census Income" dataset.
Breast Cancer Wisconsin (Diagnostic)
Diagnostic Wisconsin Breast Cancer Database.
Bank Marketing
The data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe a term deposit (variable y).
New Datasets
Dataset for Assessing Mathematics Learning in Higher Education
MathE is a mathematical platform developed under the MathE project (mathe.pixel-online.org). The dataset has 9546 answers to questions in the Mathematical topics taught in higher education. The file has eight features, named: Student ID, Student Country, Question ID, Type of answer (correct or incorrect), Question level (basic or advanced), Math Topic, Math Subtopic, and Question Keywords. The question level was associated with the professor who submitted the question. The data was obtained from February 2019 until December 2023.
Turkish Crowdfunding Startups
This dataset contains data on crowdfunding campaigns in Turkey. The dataset includes various characteristics such as crowdfunding projects, project descriptions, targeted and raised funds, campaign durations, and number of backers. Collected in 2022, this dataset provides a valuable resource for researchers who want to understand and analyze the crowdfunding ecosystem in Turkey. In total, there are data from more than 1500 projects on 6 different platforms. The dataset is particularly useful for training natural language processing (NLP) and machine learning models. This dataset is an important reference point for studies on the characteristics of successful crowdfunding campaigns and provides comprehensive information for entrepreneurs, investors and researchers in Turkey.
Synthetic Circle Data Set
This dataset comprises 10000 two-dimensional points arranged into 100 circles, each containing 100 points. It was designed to evaluate clustering algorithms, such as k-means, by providing a clear and structured clustering challenge.
Micro Gas Turbine Electrical Energy Prediction
This dataset consists of measurements of electrical power corresponding to an input control signal over time, collected from a 3-kilowatt commercial micro gas turbine.
Printed Circuit Board Processed Image
This CSV dataset, originally used for test-pad coordinate retrieval from PCB images, presents potential applications like classification (e.g., Grey test pad detection), anomaly detection (e.g., fake test pads), or clustering for grey test pads discovery. The dataset includes X and Y representing pixel positions, and R, G, B values determining pixel color (minmax normalized from 0-255). A 'Grey' field indicates approximate grey pixels. This dataset was originally used for a 2-stage discovery of high number of test pad clusters (>100) in a dataset presented in: @article{Tan2016FastRO, title={Fast retrievals of test-pad coordinates from photo images of printed circuit boards}, author={Swee Chuan Tan and Schumann Tong Wei Kit}, journal={2016 International Conference on Advanced Mechatronic Systems (ICAMechS)}, year={2016}, pages={464-467}, url={https://api.semanticscholar.org/CorpusID:38544897} } More pixels here than that in the paper due to different extraction method.
PhiUSIIL Phishing URL (Website)
PhiUSIIL Phishing URL Dataset is a substantial dataset comprising 134,850 legitimate and 100,945 phishing URLs. Most of the URLs we analyzed, while constructing the dataset, are the latest URLs. Features are extracted from the source code of the webpage and URL. Features such as CharContinuationRate, URLTitleMatchScore, URLCharProb, and TLDLegitimateProb are derived from existing features.