Test out our new website

Want to visit our new website?

Welcome to the UC Irvine Machine Learning Repository

We currently maintain 689 datasets as a service to the machine learning community. Here, you can donate and find datasets used by millions of people all around the world!

View Datasets Contribute a Dataset

Popular Datasets

Iris

A small classic dataset from Fisher, 1936. One of the earliest known datasets used for evaluating classification methods.

Classification

150 Instances

4 Features

Heart Disease

4 databases: Cleveland, Hungary, Switzerland, and the VA Long Beach

Classification

303 Instances

13 Features

Two datasets are included, related to red and white vinho verde wine samples, from the north of Portugal. The goal is to model wine quality based on physicochemical tests (see [Cortez et al., 2009], http://www3.dsi.uminho.pt/pcortez/wine/).

Classification, Regression

4.9K Instances

12 Features

Bank Marketing

The data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe a term deposit (variable y).

Classification

45.21K Instances

17 Features

Breast Cancer Wisconsin (Diagnostic)

Diagnostic Wisconsin Breast Cancer Database.

Classification

569 Instances

30 Features

Adult

Predict whether annual income of an individual exceeds $50K/yr based on census data. Also known as "Census Income" dataset.

Classification

48.84K Instances

14 Features

See More Popular Datasets

New Datasets

Amazon Product and Google Locations Reviews

This is a preprocessed dataset derived from [Google Local Reviews](https://mcauleylab.ucsd.edu/public_datasets/gdrive/googlelocal/) and [Amazon Reviews](https://amazon-reviews-2023.github.io/) that contains time series data of counts of reviews from various categories per hour.

Regression

3.06M Instances

2 Features

HLS-CMDS: Heart and Lung Sounds Dataset Recorded from a Clinical Manikin using Digital Stethoscope

This dataset contains 535 recordings of heart and lung sounds captured using a digital stethoscope from a clinical manikin, including both individual and mixed recordings of heart and lung sounds; 50 heart sounds, 50 lung sounds, and 145 mixed sounds. For each mixed sound, the corresponding source heart sound (145 recordings) and source lung sound (145 recordings) were also recorded. It includes recordings from different anatomical chest locations, with normal and abnormal sounds. Each recording has been filtered to highlight specific sound types, making it valuable for artificial intelligence (AI) research and applications.

Classification, Regression, Clustering, Other

535 Instances

0 Features

COREVQA

Recently, many benchmarks and datasets have been developed to evaluate Vision-Language Models (VLMs) using visual question answering (VQA) pairs, and models have shown significant accuracy improvements. However, these benchmarks rarely test the model's ability to accurately complete visual entailment, for instance, accepting or refuting a hypothesis based on the image. To address this, we propose COREVQA (Crowd Observations and Reasoning Entailment), a benchmark of 5608 image and synthetically generated true/false statement pairs, with images derived from the CrowdHuman dataset, to provoke visual entailment reasoning on challenging crowded images. Our results show that even the top-performing VLMs achieve accuracy below 80%, with other models performing substantially worse (39.98%-69.95%). This significant performance gap reveals key limitations in VLMs’ ability to reason over certain types of image–question pairs in crowded scenes.

Other

5.61K Instances

0 Features

Amaranthus Viridis leaves

The Amaranthus Viridis crop has been grown at London South Bank University, and different machine learning models have been used to evaluate the crop dataset. Artificial intelligence models have been used to analyse the Amaranthus Viridis Leaves image dataset comprehensively. The Convolutional Neural Network (CNN) model has been used to determine the percentage of the predicted Amaranthus leaves that match the original images from a hydroponic smart farm. The CNN forecasted a higher accuracy than the K-Nearest Neighbour, Support Vector classifier and Decision Tree model.

Classification

12 Instances

2 Features

Paddy Dataset

Agriculture occupies a third of Earth's surface and is vital for food production. Rice, grown from paddy seeds, feeds nearly half the global population. To meet rising food demands, this study aims to enhance rice production using Machine Learning (ML) to predict factors affecting paddy growth. A Hybrid ML Model with Combined Wrapper Feature Selection (HMLCWFS) was developed to address challenges like overfitting and computational costs. Five Feature Selection (FS) methods—Backward Elimination, Stepwise Forward Selection, Feature Importance, Exhaustive FS, and Gradient Boosting—were applied. Selected features were merged using Poincaré’s formula to form a refined dataset. ML models such as Decision Tree, Random Forest, SVM, KNN, and Naive Bayes were trained and tested. The model not only forecasts yield but also recommends paddy varieties based on farmers' preferences. Results show that combined FS techniques effectively identify key factors for improving paddy productivity.

Classification, Regression, Clustering

2.79K Instances

45 Features

PGCB Hourly Generation Dataset (Bangladesh)

This dataset, published by the Power Grid Company of Bangladesh (PGCB), provides hourly records of electricity generation, demand, and loadshedding across the national grid. It includes breakdowns by generation source, enabling advanced analysis of grid operations, forecasting, and generation mix optimization. PGCB is the sole national transmission operator in Bangladesh, overseeing a complex mixed-technology power grid. With an installed capacity of ~25,700 MW and full transmission coverage, it plays a key role in ensuring grid stability and managing system reliability.

Regression

92.65K Instances

14 Features

See More New Datasets