Center for Machine Learning and Intelligent Systems
About  Citation Policy  Donate a Data Set  Contact

Repository Web            Google
View ALL Data Sets

Browse Through:

Default Task

Classification (42)
Regression (16)
Clustering (9)
Other (2)

Attribute Type - Undo

Categorical (6)
Numerical (52)
Mixed (10)

Data Type - Undo

Multivariate (52)
Univariate (2)
Sequential (9)
Time-Series (13)
Text (3)
Domain-Theory (0)
Other (2)


Life Sciences (8)
Physical Sciences (8)
CS / Engineering (19)
Social Sciences (1)
Business (5)
Game (1)
Other (10)

# Attributes - Undo

Less than 10 (11)
10 to 100 (52)
Greater than 100 (26)

# Instances - Undo

Less than 100 (1)
100 to 1000 (23)
Greater than 1000 (52)

Format Type - Undo

Matrix (52)
Non-Matrix (7)

52 Data Sets

Table View  List View

1. Cloud: Little Documentation

2. Parkinson Speech Dataset with Multiple Types of Sound Recordings: The training data belongs to 20 Parkinson's Disease (PD) patients and 20 healthy subjects. From all subjects, multiple types of sound recordings (26) are taken.

3. STUDENT ALCOHOL CONSUMPTION: The result also provides the correlation between alcohol usage and the social, gender and study time attributes for each student.

4. QSAR biodegradation: Data set containing values for 41 attributes (molecular descriptors) used to classify 1055 chemicals into 2 classes (ready and not ready biodegradable).

5. Mice Protein Expression: Expression levels of 77 proteins measured in the cerebral cortex of 8 classes of control and Down syndrome mice exposed to context fear conditioning, a task used to assess associative learning.

6. Cardiotocography: The dataset consists of measurements of fetal heart rate (FHR) and uterine contraction (UC) features on cardiotocograms classified by expert obstetricians.

7. Image Segmentation: Image data described by high-level numeric-valued attributes, 7 classes

8. Statlog (Image Segmentation): This dataset is an image segmentation database similar to a database already present in the repository (Image segmentation database) but in a slightly different form.

9. Ozone Level Detection: Two ground ozone level data sets are included in this collection. One is the eight hour peak set (, the other is the one hour peak set ( Those data were collected from 1998 to 2004 at the Houston, Galveston and Brazoria area.

10. Australian Sign Language signs (High Quality): This data consists of sample of Auslan (Australian Sign Language) signs. 27 examples of each of 95 Auslan signs were captured from a native signer using high-quality position trackers

11. seismic-bumps: The data describe the problem of high energy (higher than 10^4 J) seismic bumps forecasting in a coal mine. Data come from two of longwalls located in a Polish coal mine.

12. SkillCraft1 Master Table Dataset: This data was used in Thompson et al. (2013). A list of possible game actions is discussed in Thompson, Blair, Chen, & Henrey (2013).

13. SML2010: This dataset is collected from a monitor system mounted in a domotic house. It corresponds to approximately 40 days of monitoring data.

14. Spambase: Classifying Email as Spam or Non-Spam

15. Wine Quality: Two datasets are included, related to red and white vinho verde wine samples, from the north of Portugal. The goal is to model wine quality based on physicochemical tests (see [Cortez et al., 2009],

16. Waveform Database Generator (Version 1): CART book's waveform domains

17. Waveform Database Generator (Version 2): CART book's waveform domains

18. Wall-Following Robot Navigation Data: The data were collected as the SCITOS G5 robot navigates through the room following the wall in a clockwise direction, for 4 rounds, using 24 ultrasound sensors arranged circularly around its 'waist'.

19. Page Blocks Classification: The problem consists of classifying all the blocks of the page layout of a document that has been detected by a segmentation process.

20. Optical Recognition of Handwritten Digits: Two versions of this database available; see folder

21. Parkinsons Telemonitoring: Oxford Parkinson's Disease Telemonitoring Dataset

22. First-order theorem proving: Given a theorem, predict which of five heuristics will give the fastest proof when used by a first-order prover. A sixth prediction declines to attempt a proof, should the theorem be too difficult.

23. Statlog (Landsat Satellite): Multi-spectral values of pixels in 3x3 neighbourhoods in a satellite image, and the classification associated with the central pixel in each neighbourhood

24. Air Quality: Contains the responses of a gas multisensor device deployed on the field in an Italian city. Hourly responses averages are recorded along with gas concentrations references from a certified analyzer.

25. Gesture Phase Segmentation: The dataset is composed by features extracted from 7 videos with people gesticulating, aiming at studying Gesture Phase Segmentation. It contains 50 attributes divided into two files for each video.

26. Polish companies bankruptcy data: The dataset is about bankruptcy prediction of Polish companies.The bankrupt companies were analyzed in the period 2000-2012, while the still operating companies were evaluated from 2007 to 2013.

27. Pen-Based Recognition of Handwritten Digits: Digit database of 250 samples from 44 writers

28. Condition Based Maintenance of Naval Propulsion Plants: Data have been generated from a sophisticated simulator of a Gas Turbines (GT), mounted on a Frigate characterized by a COmbined Diesel eLectric And Gas (CODLAG) propulsion plant type.

29. EEG Eye State: The data set consists of 14 EEG values and a value indicating the eye state.

30. MAGIC Gamma Telescope: Data are MC generated to simulate registration of high energy gamma particles in an atmospheric Cherenkov telescope

31. Letter Recognition: Database of character image features; try to identify the letter

32. Grammatical Facial Expressions: This dataset supports the development of models that make possible to interpret Grammatical Facial Expressions from Brazilian Sign Language (Libras).

33. default of credit card clients: This research aimed at the case of customers’ default payments in Taiwan and compares the predictive accuracy of probability of default among six data mining methods.

34. Online News Popularity: This dataset summarizes a heterogeneous set of features about articles published by Mashable in a period of two years. The goal is to predict the number of shares in social networks (popularity).

35. UJIIndoorLoc-Mag: The UJIIndoorLoc-Mag is an indoor localization database to test Indoor Positioning System that rely on Earth's magnetic field variations.

36. Facebook Comment Volume Dataset: Instances in this dataset contain features extracted from facebook posts. The task associated with the data is to predict how many comments the post will receive.

37. Bank Marketing: The data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe a term deposit (variable y).

38. KEGG Metabolic Relation Network (Directed): KEGG Metabolic pathways modeled as directed relation network. Variety of graphical features presented.

39. Dataset for Sensorless Drive Diagnosis: Features are extracted from motor current. The motor has intact and defective components. This results in 11 different classes with different conditions.

40. KEGG Metabolic Reaction Network (Undirected): KEGG Metabolic pathways modeled as un-directed reaction network. Variety of graphical features presented.

41. Corel Image Features: This dataset contains image features extracted from a Corel image collection. Four sets of features are available based on the color histogram, color histogram layout, color moments, and co-occurence

42. Diabetes 130-US hospitals for years 1999-2008: This data has been prepared to analyze factors related to readmission as well as other outcomes pertaining to patients with diabetes.

43. MiniBooNE particle identification: This dataset is taken from the MiniBooNE experiment and is used to distinguish electron neutrinos (signal) from muon neutrinos (background).

44. Buzz in social media : This data-set contains examples of buzz events from two different social networks: Twitter, and Tom's Hardware, a forum network focusing on new technology with more conservative dynamics.

45. Online Video Characteristics and Transcoding Time Dataset: The dataset contains a million randomly sampled video instances listing 10 fundamental video characteristics along with the YouTube video ID.

46. Educational Process Mining (EPM): A Learning Analytics Data Set: Educational Process Mining data set is built from the recordings of 115 subjects' activities through a logging application while learning with an educational simulator.

47. YearPredictionMSD: Prediction of the release year of a song from audio features. Songs are mostly western, commercial tracks ranging from 1922 to 2011, with a peak in the year 2000s.

48. Gas sensors for home activity monitoring: 100 recordings of a sensor array under different conditions in a home setting: background, wine and banana presentations. The array includes 8 MOX gas sensors, and humidity and temperature sensors.

49. PAMAP2 Physical Activity Monitoring: The PAMAP2 Physical Activity Monitoring dataset contains data of 18 different physical activities, performed by 9 subjects wearing 3 inertial measurement units and a heart rate monitor.

50. Record Linkage Comparison Patterns: Element-wise comparison of records with personal data from a record linkage setting. The task is to decide from a comparison pattern whether the underlying records belong to one person.

51. HEPMASS: The search for exotic particles requires sorting through a large number of collisions to find the events of interest. This data set challenges one to detect a new particle of unknown mass.

52. Heterogeneity Activity Recognition: The Heterogeneity Human Activity Recognition (HHAR) dataset from Smartphones and Smartwatches is a dataset devised to benchmark human activity recognition algorithms (classification, automatic data segmentation, sensor fusion, feature extraction, etc.) in real-world contexts; specifically, the dataset is gathered with a variety of different device models and use-scenarios, in order to reflect sensing heterogeneities to be expected in real deployments.

Supported By:

 In Collaboration With:

About  ||  Citation Policy  ||  Donation Policy  ||  Contact  ||  CML