Browse Datasets

Filters Expand All Collapse All

Amazon Product and Google Locations Reviews

This is a preprocessed dataset derived from [Google Local Reviews](https://mcauleylab.ucsd.edu/public_datasets/gdrive/googlelocal/) and [Amazon Reviews](https://amazon-reviews-2023.github.io/) that contains time series data of counts of reviews from various categories per hour.

Regression

Time-Series

3.06M Instances

2 Features

HLS-CMDS: Heart and Lung Sounds Dataset Recorded from a Clinical Manikin using Digital Stethoscope

This dataset contains 535 recordings of heart and lung sounds captured using a digital stethoscope from a clinical manikin, including both individual and mixed recordings of heart and lung sounds; 50 heart sounds, 50 lung sounds, and 145 mixed sounds. For each mixed sound, the corresponding source heart sound (145 recordings) and source lung sound (145 recordings) were also recorded. It includes recordings from different anatomical chest locations, with normal and abnormal sounds. Each recording has been filtered to highlight specific sound types, making it valuable for artificial intelligence (AI) research and applications.

Classification, Regression, Clustering, Other

Tabular

535 Instances

0 Features

COREVQA

Recently, many benchmarks and datasets have been developed to evaluate Vision-Language Models (VLMs) using visual question answering (VQA) pairs, and models have shown significant accuracy improvements. However, these benchmarks rarely test the model's ability to accurately complete visual entailment, for instance, accepting or refuting a hypothesis based on the image. To address this, we propose COREVQA (Crowd Observations and Reasoning Entailment), a benchmark of 5608 image and synthetically generated true/false statement pairs, with images derived from the CrowdHuman dataset, to provoke visual entailment reasoning on challenging crowded images. Our results show that even the top-performing VLMs achieve accuracy below 80%, with other models performing substantially worse (39.98%-69.95%). This significant performance gap reveals key limitations in VLMs’ ability to reason over certain types of image–question pairs in crowded scenes.

Other

Tabular, Text, Image

5.61K Instances

0 Features

Amaranthus Viridis leaves

The Amaranthus Viridis crop has been grown at London South Bank University, and different machine learning models have been used to evaluate the crop dataset. Artificial intelligence models have been used to analyse the Amaranthus Viridis Leaves image dataset comprehensively. The Convolutional Neural Network (CNN) model has been used to determine the percentage of the predicted Amaranthus leaves that match the original images from a hydroponic smart farm. The CNN forecasted a higher accuracy than the K-Nearest Neighbour, Support Vector classifier and Decision Tree model.

Classification

Image

12 Instances

2 Features

Paddy Dataset

Agriculture occupies a third of Earth's surface and is vital for food production. Rice, grown from paddy seeds, feeds nearly half the global population. To meet rising food demands, this study aims to enhance rice production using Machine Learning (ML) to predict factors affecting paddy growth. A Hybrid ML Model with Combined Wrapper Feature Selection (HMLCWFS) was developed to address challenges like overfitting and computational costs. Five Feature Selection (FS) methods—Backward Elimination, Stepwise Forward Selection, Feature Importance, Exhaustive FS, and Gradient Boosting—were applied. Selected features were merged using Poincaré’s formula to form a refined dataset. ML models such as Decision Tree, Random Forest, SVM, KNN, and Naive Bayes were trained and tested. The model not only forecasts yield but also recommends paddy varieties based on farmers' preferences. Results show that combined FS techniques effectively identify key factors for improving paddy productivity.

Classification, Regression, Clustering

Tabular

2.79K Instances

45 Features

PGCB Hourly Generation Dataset (Bangladesh)

This dataset, published by the Power Grid Company of Bangladesh (PGCB), provides hourly records of electricity generation, demand, and loadshedding across the national grid. It includes breakdowns by generation source, enabling advanced analysis of grid operations, forecasting, and generation mix optimization. PGCB is the sole national transmission operator in Bangladesh, overseeing a complex mixed-technology power grid. With an installed capacity of ~25,700 MW and full transmission coverage, it plays a key role in ensuring grid stability and managing system reliability.

Regression

Time-Series

92.65K Instances

14 Features

Neurofibromatosis Type 1; Clinical Symptoms of Familial and Sporadic Cases

A national NF1 database with 331 probands with tumors (167 sporadic and 142 familial cases) was evaluated.

Classification

Tabular

331 Instances

21 Features

High-Resolution Load Dataset from Smart Meters Across Various Cities in Morocco

The dataset includes detailed measurements of electricity consumption from several areas of Laâyoune, Boujdour, Marrakech and Foum Eloued. It presents both domestic and industrial consumption profiles with a frequency of recording every 10 minutes for Laâyoune Boujdour and Foum Eloued and every 30 minutes for Marrakech. The data are organized in several tables showing usage trends by city and area, which allows analysis of daily variations in demand, seasonal fluctuations and peak loads. An important element to The dataset includes detailed measurements of electricity consumption from several areas of Laâyoune, Boujdour, Marrakech and Foum Eloued. It presents both domestic and industrial consumption profiles with a frequency of recording every 10 minutes for Laâyoune Boujdour and Foum Eloued and every 30 minutes for Marrakech. All data is in amperes (A), except for Marrakech, which is in kilowatts (kW).

Classification, Regression, Clustering, Other

Time-Series

5 Instances

5 Features

Gallstone

The clinical dataset was collected from the Internal Medicine Outpatient Clinic of Ankara VM Medical Park Hospital and includes data from 319 individuals (June 2022–June 2023), 161 of whom were diagnosed with gallstone disease. It contains 38 features, including demographic, bioimpedance, and laboratory data, and was ethically approved by the Ankara City Hospital Ethics Committee (E2-23-4632). Demographic variables are age, sex, height, weight, and BMI. Bioimpedance data includes total, extracellular, and intracellular water, muscle and fat mass, protein, visceral fat area, and hepatic fat. Laboratory features are glucose, total cholesterol, HDL, LDL, triglycerides, AST, ALT, ALP, creatinine, GFR, CRP, hemoglobin, and vitamin D. The dataset is complete, with no missing values, and balanced in terms of disease status, eliminating the need for additional preprocessing. It provides a strong foundation for machine learning-based gallstone prediction using non-imaging features.

Classification

Tabular

320 Instances

37 Features

BEED: Bangalore EEG Epilepsy Dataset

The Bangalore EEG Epilepsy Dataset (BEED) is a comprehensive EEG collection for epileptic seizure detection and classification. Recorded at a neurological research centre in Bangalore, India, it features high-fidelity EEG signals captured using the standard 10-20 electrode system at a 256 Hz sampling rate. BEED contains 16,000 segments of 20-second EEG recordings evenly distributed across four categories: Healthy Subjects (0), Generalized Seizures (1), Focal Seizures (2), and Seizure Events (3), where seizure activity occurs with events like eye blinking, nail biting, or staring. Each category includes data from 20 adult subjects (ages 21-55) with equal gender representation. The dataset comprises 16 EEG channels (X1-X16) corresponding to different brain regions, with a binary label (y) indicating seizure presence (1) or absence (0). BEED supports machine learning in seizure detection, epilepsy analysis, and EEG research with its balanced, high-resolution data.

Classification

Tabular, Multivariate

8K Instances

17 Features

Rows per page

0 to 10 of 689