Browse Datasets
Sort by Date Donated, desc
HLS-CMDS: Heart and Lung Sounds Dataset Recorded from a Clinical Manikin using Digital Stethoscope
This dataset contains 535 recordings of heart and lung sounds captured using a digital stethoscope from a clinical manikin, including both individual and mixed recordings of heart and lung sounds; 50 heart sounds, 50 lung sounds, and 145 mixed sounds. For each mixed sound, the corresponding source heart sound (145 recordings) and source lung sound (145 recordings) were also recorded. It includes recordings from different anatomical chest locations, with normal and abnormal sounds. Each recording has been filtered to highlight specific sound types, making it valuable for artificial intelligence (AI) research and applications.
COREVQA
Recently, many benchmarks and datasets have been developed to evaluate Vision-Language Models (VLMs) using visual question answering (VQA) pairs, and models have shown significant accuracy improvements. However, these benchmarks rarely test the model's ability to accurately complete visual entailment, for instance, accepting or refuting a hypothesis based on the image. To address this, we propose COREVQA (Crowd Observations and Reasoning Entailment), a benchmark of 5608 image and synthetically generated true/false statement pairs, with images derived from the CrowdHuman dataset, to provoke visual entailment reasoning on challenging crowded images. Our results show that even the top-performing VLMs achieve accuracy below 80%, with other models performing substantially worse (39.98%-69.95%). This significant performance gap reveals key limitations in VLMs’ ability to reason over certain types of image–question pairs in crowded scenes.
Amaranthus Viridis leaves
The Amaranthus Viridis crop has been grown at London South Bank University, and different machine learning models have been used to evaluate the crop dataset. Artificial intelligence models have been used to analyse the Amaranthus Viridis Leaves image dataset comprehensively. The Convolutional Neural Network (CNN) model has been used to determine the percentage of the predicted Amaranthus leaves that match the original images from a hydroponic smart farm. The CNN forecasted a higher accuracy than the K-Nearest Neighbour, Support Vector classifier and Decision Tree model.
Paddy Dataset
Agriculture occupies a third of Earth's surface and is vital for food production. Rice, grown from paddy seeds, feeds nearly half the global population. To meet rising food demands, this study aims to enhance rice production using Machine Learning (ML) to predict factors affecting paddy growth. A Hybrid ML Model with Combined Wrapper Feature Selection (HMLCWFS) was developed to address challenges like overfitting and computational costs. Five Feature Selection (FS) methods—Backward Elimination, Stepwise Forward Selection, Feature Importance, Exhaustive FS, and Gradient Boosting—were applied. Selected features were merged using Poincaré’s formula to form a refined dataset. ML models such as Decision Tree, Random Forest, SVM, KNN, and Naive Bayes were trained and tested. The model not only forecasts yield but also recommends paddy varieties based on farmers' preferences. Results show that combined FS techniques effectively identify key factors for improving paddy productivity.
PGCB Hourly Generation Dataset (Bangladesh)
This dataset, published by the Power Grid Company of Bangladesh (PGCB), provides hourly records of electricity generation, demand, and loadshedding across the national grid. It includes breakdowns by generation source, enabling advanced analysis of grid operations, forecasting, and generation mix optimization. PGCB is the sole national transmission operator in Bangladesh, overseeing a complex mixed-technology power grid. With an installed capacity of ~25,700 MW and full transmission coverage, it plays a key role in ensuring grid stability and managing system reliability.
Neurofibromatosis Type 1; Clinical Symptoms of Familial and Sporadic Cases
A national NF1 database with 331 probands with tumors (167 sporadic and 142 familial cases) was evaluated.
High-Resolution Load Dataset from Smart Meters Across Various Cities in Morocco
The dataset includes detailed measurements of electricity consumption from several areas of Laâyoune, Boujdour, Marrakech and Foum Eloued. It presents both domestic and industrial consumption profiles with a frequency of recording every 10 minutes for Laâyoune Boujdour and Foum Eloued and every 30 minutes for Marrakech. The data are organized in several tables showing usage trends by city and area, which allows analysis of daily variations in demand, seasonal fluctuations and peak loads. An important element to The dataset includes detailed measurements of electricity consumption from several areas of Laâyoune, Boujdour, Marrakech and Foum Eloued. It presents both domestic and industrial consumption profiles with a frequency of recording every 10 minutes for Laâyoune Boujdour and Foum Eloued and every 30 minutes for Marrakech. All data is in amperes (A), except for Marrakech, which is in kilowatts (kW).
Gallstone
The clinical dataset was collected from the Internal Medicine Outpatient Clinic of Ankara VM Medical Park Hospital and includes data from 319 individuals (June 2022–June 2023), 161 of whom were diagnosed with gallstone disease. It contains 38 features, including demographic, bioimpedance, and laboratory data, and was ethically approved by the Ankara City Hospital Ethics Committee (E2-23-4632). Demographic variables are age, sex, height, weight, and BMI. Bioimpedance data includes total, extracellular, and intracellular water, muscle and fat mass, protein, visceral fat area, and hepatic fat. Laboratory features are glucose, total cholesterol, HDL, LDL, triglycerides, AST, ALT, ALP, creatinine, GFR, CRP, hemoglobin, and vitamin D. The dataset is complete, with no missing values, and balanced in terms of disease status, eliminating the need for additional preprocessing. It provides a strong foundation for machine learning-based gallstone prediction using non-imaging features.
BEED: Bangalore EEG Epilepsy Dataset
The Bangalore EEG Epilepsy Dataset (BEED) is a comprehensive EEG collection for epileptic seizure detection and classification. Recorded at a neurological research centre in Bangalore, India, it features high-fidelity EEG signals captured using the standard 10-20 electrode system at a 256 Hz sampling rate. BEED contains 16,000 segments of 20-second EEG recordings evenly distributed across four categories: Healthy Subjects (0), Generalized Seizures (1), Focal Seizures (2), and Seizure Events (3), where seizure activity occurs with events like eye blinking, nail biting, or staring. Each category includes data from 20 adult subjects (ages 21-55) with equal gender representation. The dataset comprises 16 EEG channels (X1-X16) corresponding to different brain regions, with a binary label (y) indicating seizure presence (1) or absence (0). BEED supports machine learning in seizure detection, epilepsy analysis, and EEG research with its balanced, high-resolution data.
RecGym: Gym Workouts Recognition Dataset with IMU and Capacitive Sensor
The RecGym dataset is a collection of gym workouts with IMU and Capacitive sensors, designed for research and development in recommendation systems and fitness applications. The data set records ten volunteers' gym sessions with a sensing unit composed of an IMU sensor (columns of A_x, A_y, A_z, G_x, G_y, G_z) and a Body Capacitance sensor (column of C_1). The sensing units were worn at three positions: on the wrist, in the pocket, and on the calf, with a sampling rate of 20 Hz. The data set contains the motion signals of twelve activities, including eleven workouts: Adductor, ArmCurl, BenchPress, LegCurl, LegPress, Riding, RopeSkipping, Running, Squat, StairsClimber, Walking, and a "Null" activity when the volunteer hangs around between different workouts session. Each participant performed the above-listed workouts for five sessions in five days (each session lasts around one hour). Altogether, fifty sessions of normalized gym workout data are presented in this data set.
0 to 10 of 688
