Center for Machine Learning and Intelligent Systems
About  Citation Policy  Donate a Data Set  Contact

Repository Web            Google
View ALL Data Sets

Browse Through:

Default Task

Classification (18)
Regression (4)
Clustering (9)
Other (2)

Attribute Type - Undo

Categorical (5)
Numerical (23)
Mixed (0)

Data Type - Undo

Multivariate (158)
Univariate (13)
Sequential (23)
Time-Series (44)
Text (13)
Domain-Theory (7)
Other (8)


Life Sciences (2)
Physical Sciences (1)
CS / Engineering (13)
Social Sciences (0)
Business (1)
Game (0)
Other (5)

# Attributes

Less than 10 (5)
10 to 100 (12)
Greater than 100 (1)

# Instances

Less than 100 (1)
100 to 1000 (2)
Greater than 1000 (18)

Format Type

Matrix (12)
Non-Matrix (11)

23 Data Sets

Table View  List View

1. UJI Pen Characters: Data consists of written characters in a UNIPEN-like format

2. Hill-Valley: Each record represents 100 points on a two-dimensional graph. When plotted in order (from 1 through 100) as the Y co-ordinate, the points will create either a Hill (a bump in the terrain) or a Valley (a dip in the terrain).

3. Ozone Level Detection: Two ground ozone level data sets are included in this collection. One is the eight hour peak set (, the other is the one hour peak set ( Those data were collected from 1998 to 2004 at the Houston, Galveston and Brazoria area.

4. UJI Pen Characters (Version 2): A pen-based database with more than 11k isolated handwritten characters

5. Libras Movement: The data set contains 15 classes of 24 instances each. Each class references to a hand movement type in LIBRAS (Portuguese name 'Lngua BRAsileira de Sinais', oficial brazilian signal language).

6. Wall-Following Robot Navigation Data: The data were collected as the SCITOS G5 robot navigates through the room following the wall in a clockwise direction, for 4 rounds, using 24 ultrasound sensors arranged circularly around its 'waist'.

7. Localization Data for Person Activity: Data contains recordings of five people performing different activities. Each person wore four sensors (tags) while performing the same scenario five times.

8. Online Handwritten Assamese Characters Dataset: This is a dataset of 8235 online handwritten assamese characters. The “online” process involves capturing of data as text is written on a digitizing tablet with an electronic pen.

9. QtyT40I10D100K: Since there is no numerical sequential data stream available in standard data sets, this data set is generated from the original T40I10D100K data set

10. Wearable Computing: Classification of Body Postures and Movements (PUC-Rio): A dataset with 5 classes (sitting-down, standing-up, standing, walking, and sitting) collected on 8 hours of activities of 4 healthy subjects. We also established a baseline performance index.

11. 3D Road Network (North Jutland, Denmark): 3D road network with highly accurate elevation information (+-20cm) from Denmark used in eco-routing and fuel/Co2-estimation routing algorithms.

12. EEG Eye State: The data set consists of 14 EEG values and a value indicating the eye state.

13. Predict keywords activities in a online social media: The data from Twitter was collected during 360 consecutive days. It was done by querying 1497 English keywords sampled from Wikipedia. This dataset is proposed in a Learning to rank setting.

14. SML2010: This dataset is collected from a monitor system mounted in a domotic house. It corresponds to approximately 40 days of monitoring data.

15. User Identification From Walking Activity: The dataset collects data from an Android smartphone positioned in the chest pocket from 22 participants walking in the wild over a predefined path.

16. Activity Recognition from Single Chest-Mounted Accelerometer: The dataset collects data from a wearable accelerometer mounted on the chest. The dataset is intended for Activity Recognition research purposes.

17. Gesture Phase Segmentation: The dataset is composed by features extracted from 7 videos with people gesticulating, aiming at studying Gesture Phase Segmentation. It contains 50 attributes divided into two files for each video.

18. Grammatical Facial Expressions: This dataset supports the development of models that make possible to interpret Grammatical Facial Expressions from Brazilian Sign Language (Libras).

19. microblogPCU: MicroblogPCU data is crawled from sina weibo microblog[]. This data can be used to study machine learning methods as well as do some social network research.

20. Machine Learning based ZZAlpha Ltd. Stock Recommendations 2012-2014: The data here are the ZZAlpha® machine learning recommendations made for various US traded stock portfolios the morning of each day during the 3 year period Jan 1, 2012 - Dec 31, 2014.

21. Taxi Service Trajectory - Prediction Challenge, ECML PKDD 2015: An accurate dataset describing trajectories performed by all the 442 taxis running in the city of Porto, in Portugal.

22. UJIIndoorLoc-Mag: The UJIIndoorLoc-Mag is an indoor localization database to test Indoor Positioning System that rely on Earth's magnetic field variations.

23. Educational Process Mining (EPM): A Learning Analytics Data Set: Educational Process Mining data set is built from the recordings of 115 subjects' activities through a logging application while learning with an educational simulator.

Supported By:

 In Collaboration With:

About  ||  Citation Policy  ||  Donation Policy  ||  Contact  ||  CML