Center for Machine Learning and Intelligent Systems
About  Citation Policy  Donate a Data Set  Contact

Repository Web            Google
View ALL Data Sets

Browse Through:

Default Task

Classification (31)
Regression (13)
Clustering (9)
Other (1)

Attribute Type - Undo

Categorical (4)
Numerical (35)
Mixed (4)

Data Type

Multivariate (28)
Univariate (3)
Sequential (4)
Time-Series (10)
Text (3)
Domain-Theory (0)
Other (0)

Area - Undo

Life Sciences (35)
Physical Sciences (29)
CS / Engineering (113)
Social Sciences (7)
Business (17)
Game (1)
Other (21)

# Attributes

Less than 10 (11)
10 to 100 (14)
Greater than 100 (8)

# Instances - Undo

Less than 100 (5)
100 to 1000 (45)
Greater than 1000 (35)

Format Type

Matrix (25)
Non-Matrix (10)

35 Data Sets

Table View  List View

1. Activity recognition using wearable physiological measurements: This dataset contains features from Electrocardiogram (ECG), Thoracic Electrical Bioimpedance (TEB) and the Electrodermal Activity (EDA) for activity recognition.

2. Activity recognition with healthy older people using a batteryless wearable sensor: Sequential motion data from 14 healthy older people aged 66 to 86 years old using a batteryless, wearable sensor on top of their clothing for the recognition of activities in clinical environments.

3. Anuran Calls (MFCCs): Acoustic features extracted from syllables of anuran (frogs) calls, including the family, the genus, and the species labels (multilabel).

4. Bar Crawl: Detecting Heavy Drinking: Accelerometer and transdermal alcohol content data from a college bar crawl. Used to predict heavy drinking episodes via mobile data.

5. Bar Crawl: Detecting Heavy Drinking: Accelerometer and transdermal alcohol content data from a college bar crawl. Used to predict heavy drinking episodes via mobile data.

6. Cardiotocography: The dataset consists of measurements of fetal heart rate (FHR) and uterine contraction (UC) features on cardiotocograms classified by expert obstetricians.

7. chipseq: ChIP-seq experiments characterize protein modifications or binding at specific genomic locations in specific samples. The machine learning problem in these data is structured binary classification.

8. Cuff-Less Blood Pressure Estimation: This Data set provides preprocessed and cleaned vital signals which can be used in designing algorithms for cuff-less estimation of the blood pressure.

9. Diabetes 130-US hospitals for years 1999-2008: This data has been prepared to analyze factors related to readmission as well as other outcomes pertaining to patients with diabetes.

10. Diabetic Retinopathy Debrecen Data Set: This dataset contains features extracted from the Messidor image set to predict whether an image contains signs of diabetic retinopathy or not.

11. Dorothea: DOROTHEA is a drug discovery dataset. Chemical compounds represented by structural molecular features must be classified as active (binding to thrombin) or inactive. This is one of 5 datasets of the NIPS 2003 feature selection challenge.

12. Drug Review Dataset ( The dataset provides patient reviews on specific drugs along with related conditions and a 10 star patient rating reflecting overall patient satisfaction.

13. EEG Eye State: The data set consists of 14 EEG values and a value indicating the eye state.

14. EEG Steady-State Visual Evoked Potential Signals: This database consists on 30 subjects performing Brain Computer Interface for Steady State Visual Evoked Potentials (BCI-SSVEP).

15. EMG data for gestures: These are files of raw EMG data recorded by MYO Thalmic bracelet

16. Epileptic Seizure Recognition: This dataset is a pre-processed and re-structured/reshaped version of a very commonly used dataset featuring epileptic seizure detection.

17. Estimation of obesity levels based on eating habits and physical condition : This dataset include data for the estimation of obesity levels in individuals from the countries of Mexico, Peru and Colombia, based on their eating habits and physical condition.

18. Hepatitis C Virus (HCV) for Egyptian patients: Egyptian patients who underwent treatment dosages for HCV about 18 months. Discretization should be applied based on expert recommendations; there is an attached file shows how.

19. KASANDR: KASANDR is a novel, publicly available collection for recommendation systems that records the behavior of customers of the European leader in e-Commerce advertising, Kelkoo.

20. KEGG Metabolic Reaction Network (Undirected): KEGG Metabolic pathways modeled as un-directed reaction network. Variety of graphical features presented.

21. KEGG Metabolic Relation Network (Directed): KEGG Metabolic pathways modeled as directed relation network. Variety of graphical features presented.

22. Localization Data for Person Activity: Data contains recordings of five people performing different activities. Each person wore four sensors (tags) while performing the same scenario five times.

23. Mice Protein Expression: Expression levels of 77 proteins measured in the cerebral cortex of 8 classes of control and Down syndrome mice exposed to context fear conditioning, a task used to assess associative learning.

24. One-hundred plant species leaves data set: Sixteen samples of leaf each of one-hundred plant species. For each sample, a shape descriptor, fine scale margin and texture histogram are given.

25. p53 Mutants: The goal is to model mutant p53 transcriptional activity (active vs inactive) based on data extracted from biophysical simulations.

26. Parkinson Speech Dataset with Multiple Types of Sound Recordings: The training data belongs to 20 Parkinson's Disease (PD) patients and 20 healthy subjects. From all subjects, multiple types of sound recordings (26) are taken.

27. Parkinsons Telemonitoring: Oxford Parkinson's Disease Telemonitoring Dataset

28. Physicochemical Properties of Protein Tertiary Structure: This is a data set of Physicochemical Properties of Protein Tertiary Structure. The data set is taken from CASP 5-9. There are 45730 decoys and size varying from 0 to 21 armstrong.

29. QSAR fish bioconcentration factor (BCF): Experimental bioconcentration factor (BCF) for 1056 molecules and binary fingeprints (extended connectivity) to be used for QSAR modeling.

30. Reuters RCV1 RCV2 Multilingual, Multiview Text Categorization Test collection: This test collection contains feature characteristics of documents originally written in five different languages and their translations, over a common set of 6 categories.

31. sEMG for Basic Hand movements: The sEMG for Basic Hand movements includes 2 databases of surface electromyographic signals of 6 hand movements using Delsys' EMG System. Healthy subjects conducted six daily life grasps.

32. Simulated Falls and Daily Living Activities Data Set: 20 falls and 16 daily living activities were performed by 17 volunteers with 5 repetitions while wearing 6 sensors (3.060 instances) that attached to their head, chest, waist, wrist, thigh and ankle.

33. Smartphone-Based Recognition of Human Activities and Postural Transitions: Activity recognition data set built from the recordings of 30 subjects performing basic activities and postural transitions while carrying a waist-mounted smartphone with embedded inertial sensors.

34. Tamilnadu Electricity Board Hourly Readings: This data can be effectively produced the result to fewer parameter of the Load profile can be reduced in the Database

35. Yeast: Predicting the Cellular Localization Sites of Proteins

Supported By:

 In Collaboration With:

About  ||  Citation Policy  ||  Donation Policy  ||  Contact  ||  CML