1. Dexter: DEXTER is a text classification problem in a bag-of-word representation. This is a two-class classification problem with sparse continuous input variables. This dataset is one of five datasets of the NIPS 2003 feature selection challenge.
2. Madelon: MADELON is an artificial dataset, which was part of the NIPS 2003 feature selection challenge. This is a two-class classification problem with continuous input variables. The difficulty is that the problem is multivariate and highly non-linear. 3. Hill-Valley: Each record represents 100 points on a two-dimensional graph. When plotted in order (from 1 through 100) as the Y co-ordinate, the points will create either a Hill (a “bump” in the terrain) or a Valley (a “dip” in the terrain). 4. Libras Movement: The data set contains 15 classes of 24 instances each. Each class references to a hand movement type in LIBRAS (Portuguese
name 'LÍngua BRAsileira de Sinais', oficial brazilian signal language). 5. Trains: 2 data formats (structured, one-instance-per-line) 6. Flags: From Collins Gem Guide to Flags, 1986 7. Meta-data: Meta-Data was used in order to give advice about which classification method is appropriate for a particular dataset (taken from results of Statlog project). 8. Australian Sign Language signs (High Quality): This data consists of sample of Auslan (Australian Sign Language) signs. 27 examples of each of 95 Auslan signs were captured from a native signer using high-quality position trackers 9. Image Segmentation: Image data described by high-level numeric-valued attributes, 7 classes 10. Statlog (Image Segmentation): This dataset is an image segmentation database similar to a database already present in the repository (Image segmentation database) but in a slightly different form. 11. Statlog (Vehicle Silhouettes): 3D objects within a 2D image by application of an ensemble of shape feature extractors to the 2D silhouettes of the objects. 12. University: Data in original (LISP-readable) form 13. Australian Sign Language signs: This data consists of sample of Auslan (Australian Sign Language) signs. Examples of 95 signs were collected from five signers with a total of 6650 sign samples. 14. Pittsburgh Bridges: Bridges database that has original and numeric-discretized datasets 15. Spoken Arabic Digit: This dataset contains timeseries of mel-frequency cepstrum coefficients (MFCCs) corresponding to spoken Arabic digits. Includes data from 44 male and 44 female native Arabic speakers. 16. Japanese Vowels: This dataset records 640 time series of 12 LPC cepstrum coefficients taken from nine male speakers. 17. Record Linkage Comparison Patterns: Element-wise comparison of records with personal data from a record linkage setting. The task is to decide from a comparison pattern whether the underlying records belong to one person. 18. Connectionist Bench (Vowel Recognition - Deterding Data): Speaker independent recognition of the eleven steady state vowels of British English using a specified training set of lpc derived log area ratios. 19. MONK's Problems: A set of three artificial domains over the same attribute space; Used to test a wide range of induction algorithms 20. Car Evaluation: Derived from simple hierarchical decision model, this database may be useful for testing constructive induction and structure discovery methods. 21. Teaching Assistant Evaluation: The data consist of evaluations of teaching performance; scores are "low", "medium", or "high" 22. Reuters-21578 Text Categorization Collection: This is a collection of documents that appeared on Reuters newswire in 1987. The documents were assembled and indexed with categories. 23. Lenses: Database for fitting contact lenses 24. Badges: Badges labeled with a "+" or "-" as a function of a person's name 25. CMU Face Images: This data consists of 640 black and white face images of people taken with varying pose (straight, left, right, up), expression (neutral, happy, sad, angry), eyes (wearing sunglasses or not), and size 26. Synthetic Control Chart Time Series: This data consists of synthetically generated control charts. 27. AutoUniv: AutoUniv is an advanced data generator for classifications tasks. The aim is to reflect the nuances and heterogeneity of real data. Data can be generated in .csv, ARFF or C4.5 formats. 28. Legal Case Reports: A textual corpus of 4000 legal cases for automatic summarization and citation analysis. For each document we collect catchphrases, citations sentences, citation catchphrases and citation classes. |