Center for Machine Learning and Intelligent Systems
About  Citation Policy  Donate a Data Set  Contact


Repository Web            Google
View ALL Data Sets

Statlog Project Data Set

Below are papers that cite this data set, with context shown. Papers were automatically harvested and associated with this data set, in collaboration with Rexa.info.

Return to Statlog Project data set page.


Gavin Brown. Diversity in Neural Network Ensembles. The University of Birmingham. 2004.

from the UCI repository (699 patterns), and the Heart disease dataset from Statlog (270 patterns). An ensemble consisting of two networks, each with five hidden nodes, was trained using NC. We use 5-fold cross-validation, and 40 trials from uniform random weights in


Jeroen Eggermont and Joost N. Kok and Walter A. Kosters. Genetic Programming for data classification: partitioning the search space. SAC. 2004.

records attributes classes Australian credit statlog 690 14 2 German credit (statlog) 1000 23 2 Pima Indians diabetes 768 8 2 Heart disease (statlog) 270 13 2 Ionosphere 351 34 2 total data set is divided into n parts. Each part is chosen once as the test set while the other ncross-validation. We will mention the results of C4.5 as reported by Freund and Shapire [4], in order to compare


Wei-Chun Kao and Kai-Min Chung and Lucas Assun and Chih-Jen Lin. Decomposition Methods for Linear Support Vector Machines. Neural Computation, 16. 2004.

D. J. Spiegelhalter, and C. C. Taylor (1994). Machine Learning, Neural and Statistical Classification. Englewood Cli®s, N.J.: Prentice Hall. Data available at http://www.ncc.up.pt/liacc/ML statlog datasets.html. Osuna, E., R. Freund, and F. Girosi (1997). Training support vector machines: An application to face detection. In Proceedings of CVPR'97, New York, NY, pp. 130--136. IEEE. Platt, J. C.


Xiaoli Z. Fern and Carla Brodley. Cluster Ensembles for High Dimensional Clustering: An Empirical Study. Journal of Machine Learning Research n, a. 2004.

(6 letters only) UCI ML archive mfeat Handwritten digits represented by Fourier coefficients (Blake and Merz, 1998) satimage StatLog Satellite image data set (training set) segmentation Image segmentation data In contrast, HBGF allows the similarity of instances and the similarity of clusters to be considered simultaneously in producing the final


Zoubin Ghahramani and Hyun-Chul Kim. Bayesian Classifier Combination. Gatsby Computational Neuroscience Unit University College London. 2003.

and using different component classifiers. We used Satellite and DNA data sets from the Statlog project([8]) and the UCI digit data set ([1]) 3 . Our goal was not to obtain the best classifier performance---for this we would have paid very careful attention to the component


Bart Hamers and J. A. K Suykens. Coupled Transductive Ensemble Learning of Kernel Models. Bart De Moor. 2003.

donated by Quinlan and is one of the Credit Approval Databases which were used in the Statlog project. There 690 observations in this data set with six numerical and eight attributes. The optimal hyperparameters for the Gaussian RBF kernel are ¡ °, ¾ 2 ¢ = (9.03e2, 12.15) . All the ensembles were based on 10 submodels. Similar to all the


Ramesh Natarajan and Edwin P D Pednault. Segmented Regression Estimators for Massive Data Sets. SDM. 2002.

Problems, SIAM, Philadelphia (1996). [4] C. Blake, E. Keogh and C. Merz, UCI repository of machine learning databases. (http://www.ics.uci.edu/ mlearn). [5] P. Brazdil and J. Gama, statlog project datasets, http://www.nccp.up.pt/liacc/ML/statlog. [6] L. Breiman, J. Friedman, R. Olshen and C. Stone, Classification and Regression Trees, Wadsworth, Belmont CA (1984). [7] G. H. Golub and C. F. Van Loan,


Jun Wang and Bin Yu and Les Gasser. Concept Tree Based Clustering Visualization with Shaded Similarity Matrices. ICDM. 2002.

similarity. has a scalability limitation. One solution is to use sampling and ensemble approaches. Using small sample sizes such as 100 or 200, we have tested the sampling approach on some Statlog datasets, including the Shuttle dataset which contains 43, 500 instances[6]. The results are promising. 6. Summary This paper proposes a new approach for getting better interpretations for clustering


Avelino J. Gonzalez and Lawrence B. Holder and Diane J. Cook. Graph-Based Concept Learning. FLAIRS Conference. 2001.

Voting Records Database available from the UCI machine learning repository (Keogh et. al 1998). The diabetes domain is the Pima Indians Diabetes Database, and the credit domain is the German Credit Dataset from the Statlog Project Databases (Keogh et. Al 1998). The Tic-Tac-Toe domain consists of 958 exhaustively generated examples. Positive examples are those where "X" starts moving and wins the game


Jochen Garcke and Michael Griebel and Michael Thess. Data Mining with Sparse Grids. Computing, 67. 2001.

for B l is suÆciently limited. The operations of the matrices C l and G l on the vectors are then computed on the fly when needed in the conjugate gradient iteration. 3.4.1 Shuttle Data The shuttle data set comes from the StatLog Project [52]. It consists of 43 500 observations in the training set and 14 500 data in the testing set and has 9 attributes and 7 classes in 22 the original version. To


Haixun Wang and Carlo Zaniolo. CMP: A Fast Decision Tree Classifier Using Multivariate Predictions. ICDE. 2000.

(Letter, Satimage, Segment and Shuttle) in the table are from the STATLOG project[6], and the two large datasets (Function 2 and Function 7) are synthetic datasets described in [5]. In these test cases there were at most N = 2 alive intervals: i) the one whose left boundary (or right boundary, depending on


Edgar Acuna and Alex Rojas. Ensembles of classifiers based on Kernel density estimators. Department of Mathematics University of Puerto Rico. 2000.

All of them have been analyzed already in a combining setup and nine of these datasets were analyzed in the Statlog Project (Michie, et. al. 1994). B) To make an analysis of the bias-variance decomposition for the misclassification error when classifiers based in kernel density


Guido Lindner and Rudi Studer. AST: Support for Algorithm Selection with a CBR Approach. PKDD. 1999.

the MLT project with its Consultant system [Consortium, 1993] as well as the Statlog project [Michie et al., 1994]), aiming at comparing the performance of a fixed set of algorithms on several data sets. In the Statlog project 23 algorithms were evaluated on 21 data sets. A similar perspective on model selection can be found in [Kohavi et al., 1997], where these ideas form the background


Ljupco Todorovski and Saso Dzeroski. Experiments in Meta-level Learning with ILP. PKDD. 1999.

used in the experiments are public domain and the experiments can be repeated. This was not the case with the StatLog dataset repository where more then half of the datasets used are not publicly available. Another improvement is the use of a unified methodology for measuring the error rate of different classification


Art B. Owen. Tubular neighbors for regression and classification. Stanford University. 1999.

neighbors and slightly worse (three more errors) than global logistic regression. Cross-validation did not identify the best performing method on the test cases. This diabetes data is one of the data sets in the statlog study (Michie et al. 1995). In the statlog study, the best method was global logistic regression, for which they report an accuracy of 77:73%. This differs from the logistic


Cesar Guerra-Salcedo and L. Darrell Whitley. Genetic Approach to Feature Selection for Ensemble Creation. GECCO. 1999.

Features Classes Train Size Test Size LandSat 36 6 4435 2000 DNA 180 39 2000 1186 Segment 19 7 210 2100 3 EXPERIMENTAL SETUP A series of experiments were carried out using publicly available datasets provided by the Project Statlog 1 and by UCI machine learning repository [C. Blake and Merz, 1998]. Table 1 shows the datasets employed for this research. 3.1 ENSEMBLE RELATED SETUPS Our main


Khaled A. Alsabti and Sanjay Ranka and Vineet Singh. CLOUDS: A Decision Tree Classifier for Large Datasets. KDD. 1998.

The first four datasets are taken from the STATLOG project, which has been a widely used benchmark in classification. 3 The Abalone," Waveform," and Isolet" datasets can be found in [13]. The Synth1" and Synth2"


Robert E. Schapire and Yoav Freund and Peter Bartlett and Wee Sun Lee. The Annals of Statistics, to appear. Boosting the Margin: A New Explanation for the Effectiveness of Voting Methods. AT&T Labs. 1998.

Each stimulus was converted into 16 primitive numerical attributes (statistical moments and edge counts) which were then scaled to fit into a range of integer values from 0 through 15." The satimage dataset is the statlog version of a satellite image dataset. According to the documentation, "This database consists of the multi-spectral values of pixels in 3 # 3 neighborhoods in a satellite image, and


Igor Kononenko and Edvard Simec and Marko Robnik-Sikonja. Overcoming the Myopia of Inductive Learning Algorithms with RELIEFF. Appl. Intell, 7. 1997.

(LYMP), and diagnosis in rheumatology (RHEU). ffl HEPA: prognostics of survival for patients suffering from hepatitis. The data was provided by Gail Gong from Carnegie-Mellon University. ffl Data sets obtained from the StatLog database[18]: diagnosis of diabetes (DIAB) and diagnosis of heart diseases (HEART). For the DIAB data set, Ragavan & Rendell [27]report 78.8% classification accuracy with


Oya Ekin and Peter L. Hammer and Alexander Kogan and Pawel Winter. Distance-Based Classification Methods. e p o r t RUTCOR ffl Rutgers Center for Operations Research ffl Rutgers University. 1996.

653 instances with 15 attributes each. Carter and Catlett [3] reported an 85.5% correct prediction rate, when using 71% of all 690 instances as the training set. 4.6 German Credit Statlog This data set contains data used to evaluate credit applications in Germany. It has 1000 instances. We used a version of this data set that was produced by Strathclyde University. In this version each case is


Georgios Paliouras and David S. Brée. The Effect of Numeric Features on the Scalability of Inductive Learning Programs. ECML. 1995.

was acquired from the UCI Repository [13] and its original donor was D.J. Slate. Its author has used it as an application domain for Holland-style genetic classifier systems [7]. More recently the data set has also been used in the StatLog project [11]. The data set contains 20; 000 instances, of which roughly 16; 000 have been used for learning in this experiment. Each instance corresponds to an


Ron Kohavi. The Power of Decision Tables. ECML. 1995.

with continuous features, we chose the rest of the StatLog datasets except shuttle, which was too big, and all the datasets used by Holte (1993). 4.1 Methodology We now define the exact settings used in the algorithms. The estimated accuracy for each node was


Ron Kohavi and George H. John and Richard Long and David Manley and Karl Pfleger. MLC++: A Machine Learning Library in C. ICTAI. 1994.

but they are not an integrated environment, and are not very efficient. StatLog [14] is an ESPRIT project studying the behavior of over twenty algorithms (mostly in the MLToolbox), on over twenty datasets. StatLog is an instance of a good experimental study, but does not provide the tools to aid researchers in performing similar studies. Wray Buntine has recently suggested a unified approach to some


Wl odzisl and aw Duch. Control and Cybernetics. Department of Computer Methods, Nicholas Copernicus University.

and thus also belong to the SBM. All of these methods may be useful in control problems. A review of many approaches to classification and comparison of performance of 20 methods on 20 real world datasets has been done within the StatLog European Community project (Michie et al. 1994). More recently the accuracy of 24 neuralbased, pattern recognition and statistical classification systems has been


Wl odzisl/aw Duch and Rudy Setiono and Jacek M. Zurada. Computational intelligence methods for rule-based data understanding.

diabetes. Eight attributes describe age, number of times pregnant, body mass index, plasma glucose concentration, diastolic blood pressure, diabetes pedigree function, and other medical tests. This dataset was used in the Statlog project [89], with the best 10-fold cross-validation accuracy around 77.7% obtained by logistic discriminant analysis. Our estimation of variance on cross-validation


Wl/odzisl/aw Duch and Rafal/ Adamczak Email:duchraad@phys. uni. torun. pl. Statistical methods for construction of neural networks. Department of Computer Methods, Nicholas Copernicus University.

cases give results that are comparable or better than those found by neural networks. A review of different approaches to classification and comparison of performance of 20 methods on 22 real world datasets has been done within the StatLog European Community project [3]. The algorithms that appeared most frequently as the top five were all of statistical nature, including four discriminant approaches:


Chih-Wei Hsu and Cheng-Ru Lin. A Comparison of Methods for Multi-class Support Vector Machines. Department of Computer Science and Information Engineering National Taiwan University.

iris, wine, glass, and vowel. Those problems had already been tested in [27]. From Statlog collection we choose all multi-class datasets: vehicle, segment, dna, satimage, letter, and shuttle. Note that except problem dna we scale all training data to be in [-1, 1]. Then test data are adjusted using the same linear transformation.


Alexander K. Seewald. Dissertation Towards Understanding Stacking Studies of a General Ensemble Learning Scheme ausgefuhrt zum Zwecke der Erlangung des akademischen Grades eines Doktors der technischen Naturwissenschaften.

features which uniquely characterize the dataset. These were inspired by the StatLOG project (Brazdil, Gama & Henry, 1994). Space restrictions prevent us from giving exact formulas for each case, but a reference implementation is available from


Wl/odzisl/aw Duch. Support Vector Neural Training. Index Terms--.

has been re-analyzed with a number of methods available in the Ghostminer package [11]. Many other results for this dataset may be found in the Statlog book [13]. Best results (Table I) were achieved with the k-Nearest Neighbors classifier with small k (automatic selection using crossvalidation tests found optimal k=3),


Alexander K. Seewald. Meta-Learning for Stacked Classification. Austrian Research Institute for Artificial Intelligence.

features which uniquely characterize the dataset. These were inspired by the StatLOG project (Brazdil, Gama & Henery, 1994) and reimplemented in WEKA. # Inst, the number of examples. # log(Inst) which is the natural logarithm of Inst. # Classes,


Wl/odzisl/aw Duch and Karol Grudzinski. Meta-learning: searching in the model space. Department of Computer Methods, Nicholas Copernicus University.

on which they work well. A review of many approaches to classification and comparison of performance of 20 methods on 20 real world datasets has been done within the StatLog European Community project [2]. The accuracy of 24 neural-based, pattern recognition and statistical classification systems has been compared on 11 large datasets


Kuan-ming Lin and Chih-Jen Lin. A Study on Reduced Support Vector Machines. Department of Computer Science and Information Engineering National Taiwan University.

votes. The implementation of all methods mentioned above is available upon request. V. EXPERIMENTS In this section we conduct experiments on some commonly used problems. We choose large multiclass datasets from the Statlog collection: dna, satimage, letter, and shuttle [16]. We also consider mnist [9], an important benchmark for handwritten digit recognition. The problem ijcnn1 is from the first


Je Scott and Mahesan Niranjan and Richard W. Prager. Realisable Classifiers: Improving Operating Performance on Variable Cost Problems. Cambridge University Department of Engineering.

the convex hull over the Test data ROC curves. The ROC curves for System 1 and System 2 using the Unseen data are plotted for comparison with the MRROC. 3.3 LandSat data A LandSat image segmentation dataset, originally used in the Statlog project, was obtained from the UCI repository [13, 12]. The data consisted of multi-spectral values of pixels in 3 # 3 neighbourhoods in a satellite image. A


Yishay Mansour. Pessimistic decision tree pruning based on tree size. Computer Science Dept. Tel-Aviv University.

statlog Comparative testing and evaluation of statistical and logical learning algorithms for large-scale applications in classification, prediction and control. ftp:ftp.ncc.up.pt/pub/statlog /datasets, (See also: Machine Learning, Neural and Statistical Classification, ed. Michie, Spiegelhalter and Taylor). [VC71] V. N. Vapnik and A. Ya. Chervonenkis. On the uniform convergence of relative


Guido Lindner and Rudi Studer. Algorithm Selection Support for Classification. DaimlerChrysler AG, Research & Technology FT3/KL.

[Sleeman et al., 1995]. Such an approach is very diÆcult to maintain: each time a new algorithm has to be included one has to recompute all the rules. The Statlog project tried to describe data sets for a meta learning step to generate rules that specify in which case which algorithm is (possibly) applicable. The generated rules use hard boundaries within their condition part. However, instead


Ron Kohavi and George H. John. Automatic Parameter Selection by Minimizing Estimated Error. Computer Science Dept. Stanford University.

being studied. We report experiments with this method on 33 datasets selected from the UCI and StatLog collections using C4.5 as the basic induction algorithm. At a 90% confidence level, our method improves the performance of C4.5 on nine domains, degrades


Iñaki Inza and Pedro Larraaga and Ramon Etxeberria and Basilio Sierra. Feature Subset Selection by Bayesian networks based optimization. Dept. of Computer Science and Artificial Intelligence. University of the Basque Country.

come from the UCI repository [66]. Image dataset comes from the Statlog project [83]. LED24 (Breiman et al. [15]) is a well known artificial dataset with 7 equally relevant and 17 irrelevant binary features. We designed another artificial domain,


Ron Kohavi and George John and Richard Long and David Manley and Karl Pfleger. Appears in Tools with AI '94. Computer Science Department Stanford University.

but they are not an integrated environment, and are not very efficient. StatLog [19] is an ESPRIT project studying the behavior of over twenty algorithms (mostly in the MLToolbox), on over twenty datasets. StatLog is an instance of a good experimental study, but does not provide the tools to aid researchers in performing similar studies. Wray Buntine has recently suggested a unified approach to some


H. -T Lin and C. -J Lin. A Study on Sigmoid Kernels for SVM and the Training of non-PSD Kernels by SMO-type Methods. Department of Computer Science and Information Engineering National Taiwan University.

D. J. Spiegelhalter, and C. C. Taylor (1994). Machine Learning, Neural and Statistical Classification. Englewood Cliffs, N.J.: Prentice Hall. Data available at http://www.ncc.up.pt/liacc/ML statlog datasets.html. Nash, S. G. and A. Sofer (1996). Linear and Nonlinear Programming. McGraw-Hill. Osuna, E., R. Freund, and F. Girosi (1997). Training support vector machines: An application to face detection.


Jun Wang. Classification Visualization with Shaded Similarity Matrix. Bei Yu Les Gasser Graduate School of Library and Information Science University of Illinois at Urbana-Champaign.

explored in the future. The purpose of this section is to see if it is effective to use simple random sampling with very small sample size. To this end, we test the ensemble classifier on 5 Statlog data sets: Satimage, Segment, Shuttle, Australian, and DNA. For data description, please see Table 3. The reason to use these 5 Statlog data sets is because Ankerst used them as benchmark in his PBC system


Rong-En Fan and P. -H Chen and C. -J Lin. Working Set Selection Using the Second Order Information for Training SVM. Department of Computer Science and Information Engineering National Taiwan University.

D. J. Spiegelhalter, and C. C. Taylor. Machine Learning, Neural and Statistical Classification. Prentice Hall, Englewood Cliffs, N.J., 1994. Data available at http://www.ncc.up.pt/liacc/ML statlog datasets.html. E. Osuna, R. Freund, and F. Girosi. Training support vector machines: An application to face detection. In Proceedings of CVPR'97, pages 130--136, New York, NY, 1997. IEEE. Laura Palagi and


Wl odzisl/aw Duch and Karol Grudzinski. Search and global minimization in similarity-based methods. Department of Computer Methods, Nicholas Copernicus University.

significant improvements of results obtained with reduced feature set or with weighted features are obtained. Other tests that we have performed indicate that for more than half of the Statlog datasets [1] feature selection and weighting makes k-NN results better than that of any other classifiers used in this project. The same feature selection and weighting methods may be used to improve


Wl odzisl and aw Duch. Committees of Undemocratic Competent Models. School of Computer Engineering Nanyang Technological University.

k=7, Euclidean 94.9 95.3 best single CUC model Dipol92 98.3 95.2 STATLOG Alloc80 93.7 94.3 Statlog Quadratic DA 100 94.1 Statlog LDA 96.6 94.1 Statlog TABLE II COMPARISON OF RESULTS ON THE LETTER DATASET. RESULTS ARE FROM THE STATLOG BOOK OR OUR OWN CALCULATIONS . System Train % Test % Remarks CUC committee 98.5 96.5 Majority committee 95.8 95.4 kNN, k=5, Euclidean 94.8 95.4 best single CUC model


Krzysztof Grabczewski and Wl/odzisl/aw Duch. THE SEPARABILITY OF SPLIT VALUE CRITERION. Department of Computer Methods, Nicolaus Copernicus University.

analyzed in the Stalog project [6]. Results of the C4.5 decision tree are already significantly worse. 5.4 Statlog Australian credit data This dataset contains 690 cases classified in 2 classes (+ and -). Data vectors are described by 14 attributes (6 continuous and 8 discrete). In the Table 4 a comparison of 10 fold crossvalidation results for


C. esar and Cesar Guerra-Salcedo and Darrell Whitley. Feature Selection Mechanisms for Ensemble Creation : A Genetic Search Perspective. Department of Computer Science Colorado State University.

1993) and a table-based classifier called Euclidean Decision Tables (EDT) (Guerra-Salcedo & Whitley 1998). Setups and Results A series of experiments were carried out using publicly available datasets provided by the Project Statlog 1 the UCI machine learning repository (C. Blake & Merz 1998) and by Richard Bankert of the Naval Research Laboratory. Table 1 shows the datasets employed for this


Elena Smirnova and Ida G. Sprinkhuizen-Kuyper and I. Nalbantis and b. ERIM and Universiteit Rotterdam. Unanimous Voting using Support Vector Machines. IKAT, Universiteit Maastricht.

the hypothesis space H contains the target hyperplane, the hyperplane is consistent with the training data; i.e., it belongs to the version space [7, 11]. Thus, the unanimous-voting classification Data Set Parameters Cvssvm Avssvm Asvm I Heart Statlog P, E=2.0, C=1730 56.3% 100% 73.0% 0.42 Heart-Statlog RBF, G=0.2 , C=2182 40.7% 100% 73.7 % 0.24 Hepatitis P, E=1.4, C=11.7 80.0% 100% 80.0 % 0.72


Ron Kohavi and Barry G. Becker and Dan Sommerfield. Improving Simple Bayes. Data Mining and Visualization Group Silicon Graphics, Inc.

was large or artificial, indicating that a single test set would yield accurate estimates, we used a training-set/test-set as defined in the source for the dataset (e.g., Statlog defined the splits for DNA, letter, satimage; CART defined the training size for waveform and led24) or a 2/3, 1/3 split, and ran the inducer once; otherwise, we performed 10-fold


Return to Statlog Project data set page.

Supported By:

 In Collaboration With:

About  ||  Citation Policy  ||  Donation Policy  ||  Contact  ||  CML