Center for Machine Learning and Intelligent Systems
About  Citation Policy  Donate a Data Set  Contact

Repository Web            Google
View ALL Data Sets

Browse Through:

Default Task

Classification (15)
Regression (9)
Clustering (2)
Other (1)

Attribute Type

Categorical (1)
Numerical (22)
Mixed (0)

Data Type - Undo

Multivariate (24)
Univariate (0)
Sequential (1)
Time-Series (7)
Text (1)
Domain-Theory (1)
Other (3)

Area - Undo

Life Sciences (30)
Physical Sciences (24)
CS / Engineering (82)
Social Sciences (13)
Business (9)
Game (6)
Other (26)

# Attributes

Less than 10 (5)
10 to 100 (15)
Greater than 100 (4)

# Instances - Undo

Less than 100 (2)
100 to 1000 (15)
Greater than 1000 (24)

Format Type

Matrix (19)
Non-Matrix (5)

24 Data Sets

Table View  List View

1. Airfoil Self-Noise: NASA data set, obtained from a series of aerodynamic and acoustic tests of two and three-dimensional airfoil blade sections conducted in an anechoic wind tunnel.

2. Amazon Commerce reviews set: The dataset is used for authorship identification in online Writeprint which is a new research field of pattern recognition.

3. Beijing PM2.5 Data: This hourly data set contains the PM2.5 data of US Embassy in Beijing. Meanwhile, meteorological data from Beijing Capital International Airport are also included.

4. Cloud: Little Documentation

5. Concrete Compressive Strength: Concrete is the most important material in civil engineering. The concrete compressive strength is a highly nonlinear function of age and ingredients.

6. Crowdsourced Mapping: Crowdsourced data from OpenStreetMap is used to automate the classification of satellite images into different land cover classes (impervious, farm, forest, grass, orchard, water).

7. Electrical Grid Stability Simulated Data : The local stability analysis of the 4-node star system (electricity producer is in the center) implementing Decentral Smart Grid Control concept.

8. Greenhouse Gas Observing Network: Design an observing network to monitor emissions of a greenhouse gas (GHG) in California given time series of synthetic observations and tracers from weather model simulations.

9. HEPMASS: The search for exotic particles requires sorting through a large number of collisions to find the events of interest. This data set challenges one to detect a new particle of unknown mass.

10. HTRU2: Pulsar candidates collected during the HTRU survey. Pulsars are a type of star, of considerable scientific interest. Candidates must be classified in to pulsar and non-pulsar classes to aid discovery.

11. Individual household electric power consumption: Measurements of electric power consumption in one household with a one-minute sampling rate over a period of almost 4 years. Different electrical quantities and some sub-metering values are available.

12. MAGIC Gamma Telescope: Data are MC generated to simulate registration of high energy gamma particles in an atmospheric Cherenkov telescope

13. MiniBooNE particle identification: This dataset is taken from the MiniBooNE experiment and is used to distinguish electron neutrinos (signal) from muon neutrinos (background).

14. Musk (Version 2): The goal is to learn to predict whether new molecules will be musks or non-musks

15. Ozone Level Detection: Two ground ozone level data sets are included in this collection. One is the eight hour peak set (, the other is the one hour peak set ( Those data were collected from 1998 to 2004 at the Houston, Galveston and Brazoria area.

16. PM2.5 Data of Five Chinese Cities: This hourly data set contains the PM2.5 data in Beijing, Shanghai, Guangzhou, Chengdu and Shenyang. Meanwhile, meteorological data for each city are also included.

17. Solar Flare: Each class attribute counts the number of solar flares of a certain class that occur in a 24 hour period

18. Statlog (Landsat Satellite): Multi-spectral values of pixels in 3x3 neighbourhoods in a satellite image, and the classification associated with the central pixel in each neighbourhood

19. Statlog (Shuttle): The shuttle dataset contains 9 attributes all of which are numerical. Approximately 80% of the data belongs to class 1

20. Steel Plates Faults: A dataset of steel plates’ faults, classified into 7 different types. The goal was to train machine learning for automatic pattern recognition.

21. Superconductivty Data: Two file s contain data on 21263 superconductors and their relevant features.

22. Waveform Database Generator (Version 1): CART book's waveform domains

23. Waveform Database Generator (Version 2): CART book's waveform domains

24. Weight Lifting Exercises monitored with Inertial Measurement Units: Six young health subjects were asked to perform 5 variations of the biceps curl weight lifting exercise. One of the variations is the one predicted by the health professional.

Supported By:

 In Collaboration With:

About  ||  Citation Policy  ||  Donation Policy  ||  Contact  ||  CML