Center for Machine Learning and Intelligent Systems
About  Citation Policy  Donate a Data Set  Contact

Repository Web            Google
View ALL Data Sets

Browse Through:

Default Task

Classification (23)
Regression (8)
Clustering (2)
Other (2)

Attribute Type

Categorical (2)
Numerical (28)
Mixed (3)

Data Type - Undo

Multivariate (34)
Univariate (0)
Sequential (1)
Time-Series (6)
Text (1)
Domain-Theory (2)
Other (5)

Area - Undo

Life Sciences (73)
Physical Sciences (34)
CS / Engineering (71)
Social Sciences (18)
Business (11)
Game (7)
Other (41)

# Attributes

Less than 10 (7)
10 to 100 (20)
Greater than 100 (7)

# Instances

Less than 100 (2)
100 to 1000 (15)
Greater than 1000 (17)

Format Type

Matrix (27)
Non-Matrix (7)

34 Data Sets

Table View  List View

1. Airfoil Self-Noise: NASA data set, obtained from a series of aerodynamic and acoustic tests of two and three-dimensional airfoil blade sections conducted in an anechoic wind tunnel.

2. Amazon Commerce reviews set: The dataset is used for authorship identification in online Writeprint which is a new research field of pattern recognition.

3. Annealing: Steel annealing data

4. Challenger USA Space Shuttle O-Ring: Task: predict the number of O-rings that experience thermal distress on a flight at 31 degrees F given data on the previous 23 shuttle flights

5. Climate Model Simulation Crashes: Given Latin hypercube samples of 18 climate model input parameter values, predict climate model simulation crashes and determine the parameter value combinations that cause the failures.

6. Cloud: Little Documentation

7. Coil 1999 Competition Data: This data set is from the 1999 Computational Intelligence and Learning (COIL) competition. The data contains measurements of river chemical concentrations and algae densities.

8. Concrete Compressive Strength: Concrete is the most important material in civil engineering. The concrete compressive strength is a highly nonlinear function of age and ingredients.

9. Connectionist Bench (Sonar, Mines vs. Rocks): The task is to train a network to discriminate between sonar signals bounced off a metal cylinder and those bounced off a roughly cylindrical rock.

10. Cylinder Bands: Used in decision tree induction for mitigating process delays known as "cylinder bands" in rotogravure printing

11. Forest Fires: This is a difficult regression task, where the aim is to predict the burned area of forest fires, in the northeast region of Portugal, by using meteorological and other data (see details at:

12. Glass Identification: From USA Forensic Science Service; 6 types of glass; defined in terms of their oxide content (i.e. Na, Fe, K, etc)

13. Greenhouse Gas Observing Network: Design an observing network to monitor emissions of a greenhouse gas (GHG) in California given time series of synthetic observations and tracers from weather model simulations.

14. Individual household electric power consumption: Measurements of electric power consumption in one household with a one-minute sampling rate over a period of almost 4 years. Different electrical quantities and some sub-metering values are available.

15. Ionosphere: Classification of radar returns from the ionosphere

16. Low Resolution Spectrometer: From IRAS data -- NASA Ames Research Center

17. MAGIC Gamma Telescope: Data are MC generated to simulate registration of high energy gamma particles in an atmospheric Cherenkov telescope

18. MiniBooNE particle identification: This dataset is taken from the MiniBooNE experiment and is used to distinguish electron neutrinos (signal) from muon neutrinos (background).

19. Musk (Version 1): The goal is to learn to predict whether new molecules will be musks or non-musks

20. Musk (Version 2): The goal is to learn to predict whether new molecules will be musks or non-musks

21. Ozone Level Detection: Two ground ozone level data sets are included in this collection. One is the eight hour peak set (, the other is the one hour peak set ( Those data were collected from 1998 to 2004 at the Houston, Galveston and Brazoria area.

22. Robot Execution Failures: This dataset contains force and torque measurements on a robot after failure detection. Each failure is characterized by 15 force/torque samples collected at regular time intervals

23. Shuttle Landing Control: Tiny database; all nominal values

24. Solar Flare: Each class attribute counts the number of solar flares of a certain class that occur in a 24 hour period

25. Statlog (Landsat Satellite): Multi-spectral values of pixels in 3x3 neighbourhoods in a satellite image, and the classification associated with the central pixel in each neighbourhood

26. Statlog (Shuttle): The shuttle dataset contains 9 attributes all of which are numerical. Approximately 80% of the data belongs to class 1

27. Steel Plates Faults: A dataset of steel plates’ faults, classified into 7 different types. The goal was to train machine learning for automatic pattern recognition.

28. Urban Land Cover: Classification of urban land cover using high resolution aerial imagery. Intended to assist sustainable urban planning efforts.

29. Water Treatment Plant: Multiple classes predict plant state

30. Waveform Database Generator (Version 1): CART book's waveform domains

31. Waveform Database Generator (Version 2): CART book's waveform domains

32. Weight Lifting Exercises monitored with Inertial Measurement Units: Six young health subjects were asked to perform 5 variations of the biceps curl weight lifting exercise. One of the variations is the one predicted by the health professional.

33. Wine: Using chemical analysis determine the origin of wines

34. Yacht Hydrodynamics: Delft data set, used to predict the hydodynamic performance of sailing yachts from dimensions and velocity.

Supported By:

 In Collaboration With:

About  ||  Citation Policy  ||  Donation Policy  ||  Contact  ||  CML