Statlog (Shuttle) Data Set
Below are papers that cite this data set, with context shown.
Papers were automatically harvested and associated with this data set, in collaboration with Rexa.info.
Return to Statlog (Shuttle) data set page.
Ira Cohen and Fabio Gagliardi Cozman and Nicu Sebe and Marcelo Cesar Cirelo and Thomas S. Huang. Semisupervised Learning of Classifiers: Theory, Algorithms, and Their Application to HumanComputer Interaction. IEEE Trans. Pattern Anal. Mach. Intell, 26. 2004.
EMTAN can sometimes improve performance over TAN with just labeled data Shuttle . With the Chess dataset, discarding the unlabeled data and using only TAN seems the best approach. We have compared two likelihood based structure learning methods (K2 and MCMC) on the same datasets as well [34], showing
Jun Wang and Bin Yu and Les Gasser. Concept Tree Based Clustering Visualization with Shaded Similarity Matrices. ICDM. 2002.
similarity. has a scalability limitation. One solution is to use sampling and ensemble approaches. Using small sample sizes such as 100 or 200, we have tested the sampling approach on some Statlog datasets, including the Shuttle dataset which contains 43, 500 instances[6]. The results are promising. 6. Summary This paper proposes a new approach for getting better interpretations for clustering
Richard Nock. Inducing Interpretable Voting Classifiers without Trading Accuracy for Simplicity: Theoretical Results, Approximation Algorithms, and Experiments. J. Artif. Intell. Res. (JAIR, 17. 2002.
on which we ran C4.5, WIDC(p) finds smaller formulas, and still beats C4.5's accuracy on 9 of them. A quantitative comparison of l DC against the number of nodes of the DTs shows that on 4 datasets out of the 13 (Pole, Shuttle TicTacToe, Australian), the DCs are more than 6 times smaller, while they only incur a loss in accuracy for 2 of them, and limited to 1.8%. For this latter problem
Grigorios Tsoumakas and Ioannis P. Vlahavas. Effective Stacking of Distributed Classifiers. ECAI. 2002.
the complexity of the learning problem at the combination phase is related to the product C # L, where L is the size of the metalevel training set. The variation present in the Letter and Shuttle data sets is due to the fact that the calculation of averages also involves time. This however is very small compared to the time needed for global learning. 0 10 20 30 40 50 60 70 80 90 0 5 10 15 20 25 30
Stephen D. Bay. Multivariate Discretization for Set Mining. Knowl. Inf. Syst, 3. 2001.
data (i.e. it is a satellite image). It contains multispectral values for 3#3 pixel neighborhood and the soil type (e.g. red soil, cotton crop, grey soil, etc.). { Shuttle This is a classification dataset that deals with the positioning of radiators in the Space Shuttle. { UCI Admissions Data. This dataset represents all undergraduate student applications to UCI for the years 19931999. There are
Jochen Garcke and Michael Griebel and Michael Thess. Data Mining with Sparse Grids. Computing, 67. 2001.
for B l is suĈciently limited. The operations of the matrices C l and G l on the vectors are then computed on the fly when needed in the conjugate gradient iteration. 3.4.1 shuttle Data The shuttle data set comes from the StatLog Project [52]. It consists of 43 500 observations in the training set and 14 500 data in the testing set and has 9 attributes and 7 classes in 22 the original version. To
Haixun Wang and Carlo Zaniolo. CMP: A Fast Decision Tree Classifier Using Multivariate Predictions. ICDE. 2000.
might fall. Figure 2 shows a hypothetical gini curve, and three alive intervals (shaded areas in the figure). Our experiments of the estimation method are summarized in Table 1. The first four small datasets (Letter, Satimage, Segment and Shuttle in the table are from the STATLOG project[6], and the two large datasets (Function 2 and Function 7) are synthetic datasets described in [5]. In these test
Khaled A. Alsabti and Sanjay Ranka and Vineet Singh. CLOUDS: A Decision Tree Classifier for Large Datasets. KDD. 1998.
lower than most of the other points along the splitting attribute as well as other attributes. Figure 1 gives the value of the gini index along each of the nine numeric attributes of the Shuttle dataset. We show that the above properties can be used to develop an I/O and computationally eĈcient method for estimating the split at every internal node. Experimental results on real and synthetic
Nir Friedman and Moisés Goldszmidt. Discretizing Continuous Attributes While Learning Bayesian Networks. ICML. 1996.
from the Irvine repository [15]. We estimated the accuracy of the learned classifiers using 5fold crossvalidation, except for the shuttle small" and "waveform21" datasets where we used the holdout method. We report the mean of the prediction accuracies over all crossvalidation folds. We also report the standard deviation of the accuracies found in each fold. These
Ron Kohavi. Scaling Up the Accuracy of NaiveBayes Classifiers: A DecisionTree Hybrid. KDD. 1996.
the accuracy of medical diagnosis from 98% to 99% may cut costs by half because the number of errors is halved. Figure 5 shows the ratio of errors (where error is 100%accuracy). The shuttle dataset, which is the largest dataset tested, has only 0.04% absolute difference between NBTree and C4.5, but the error decreases from 0.05% to 0.01%, which is a huge relative improvement. The number of
Pedro Domingos. LinearTime Rule Induction. KDD. 1996.
level of C4.5RULES's. In this paper we present CWS, a new algorithm with guaranteed O(e) complexity, and verify that it outperforms C4.5RULES and CN2 in time, accuracy and output size on two large datasets. For example, on NASA's space shuttle database, running time is reduced from over a month (for C4.5RULES) to a few hours, with a slight gain in accuracy. CWS is based on interleaving the induction
Ron Kohavi. A Study of CrossValidation and Bootstrap for Accuracy Estimation and Model Selection. IJCAI. 1995.
vehicle, the generalization accuracy of the NaiveBayes algorithm deteriorated by more than 4% as more instances were given. A similar phenomenon was observed on the shuttle dataset. Such a phenomenon was predicted by Schaffer and Wolpert (Schaffer 1994, Wolpert 1994b), but we were surprised that it was observed on two realworld datasets. To see how well an accuracy estimation
Ron Kohavi. The Power of Decision Tables. ECML. 1995.
with continuous features, we chose the rest of the StatLog datasets except shuttle which was too big, and all the datasets used by Holte (1993). 4.1 Methodology We now define the exact settings used in the algorithms. The estimated accuracy for each node was
Adil M. Bagirov and Julien Ugon. An algorithm for computation of piecewise linear function separating two sets. CIAO, School of Information Technology and Mathematical Sciences, The University of Ballarat.
accuracy (a mc in Tables 2 and 3) as described above. First accuracy is an indication of separation quality and the second one is an indication of multiclass classification quality. 5.2 Datasets The datasets used are the Shuttle control , the Letter recognition, the Landsat satellite image, the Penbased recognition of handwritten and the Page blocks classification databases. Table 1
Ron Kohavi and George H. John. Automatic Parameter Selection by Minimizing Estimated Error. Computer Science Dept. Stanford University.
represent all of the available StatLog datasets except the Shuttle database (which was too large), all of the UCI datasets used by Holte (1993), all of the Monks datasets (Thrun et al. 1991), and Corral which is an artificial dataset presented
Wl odzisl/aw Duch and Rudy Setiono and Jacek M. Zurada. Computational intelligence methods for rulebased data understanding.
Indeed, visual inspection of the data using multidimensional scaling would show many paired vectors, including 26 identical pairs of vectors in the training data. G. NASA Shuttle The Shuttle dataset from NASA contains nine continuous numerical attributes related to the positions of radiators in the Space Shuttle [118]. There are 43500 training vectors and 14500 test vectors, divided into seven
Chris Giannella and Bassem Sayrafi. An Information Theoretic Histogram for Single Dimensional Selectivity Estimation. Department of Computer Science, Indiana University Bloomington.
from the "Esprit Project 5170 StatLog" archive ( shuttle heading): www.liacc.up.pt/ML/. It represents data concerning the operation of the NASA space shuttle. We use attribute two. The remaining datasets were obtained from the UCI KDD archive [7]. The forestcov4 and forestcov9 datasets were found under the "Forest CoverType" heading, covtype.data file  attributes four and nine, respectively. The
Christophe Giraud and Tony Martinez. ADYNAMIC INCREMENTAL NETWORK THAT LEARNS BY DISCRIMINATION. AA.
nonBoolean inputs have their values translated into their equivalent binary form. Only one form of selfdeletion, namely, complete discriminant deletion [10], is implemented in the simulations. The dataset shuttle exp consists of the complete set of 278 instances resulting from expanding the 15 rules of the shuttlelandingcontrol (shuttlelc) dataset. 6 Reported results for hepatitis and shuttleexp
Wl odzisl and Rafal Adamczak and Krzysztof Grabczewski. Optimization of Logical Rules Derived by Neural Procedures. Department of Computer Methods, Nicholas Copernicus University.
77.2 4/4 78.0 Diabetes 2/2 75.0 Hepatitis 3/5 88.4 Heart (Cleveland) 4/3 85.5 Hypothyroid 3/5 99.4 Iris 3/1 95.3 3/2 98.0 Mushrooms 2/1 98.5 3/2 99.4 4/4 99.9 5/6 100.0 A. NASA Shuttle The Shuttle dataset from NASA contains 9 continuos numerical attributes related to the positions of radiators in the Space Shuttle. There are 43500 training vectors and 14500 test vectors, divided into 7 classes in a
ChihWei Hsu and ChengRu Lin. A Comparison of Methods for Multiclass Support Vector Machines. Department of Computer Science and Information Engineering National Taiwan University.
; 2 3 ; 2 2 ; : : : ; 2 10 ] and C = [2 12 ; 2 11 ; 2 10 ; : : : ; 2 2 ]. Therefore, for each problem we try 15 # 15 = 225 combinations. We use two criteria to estimate the generalized accuracy. For datasets dna, satimage, letter, and shuttle where both training and testing sets are available, for each pair of (C; #), the validation performance is measured by training 70% of the training set and
Jeffrey P. Bradford and Clayton Kunz and Ron Kohavi and Clifford Brunk and Carla Brodley. Appears in ECML98 as a research note Pruning Decision Trees with Misclassification Costs. School of Electrical Engineering.
classification based on census bureau data), breast cancer diagnosis, chess, crx (credit), german (credit), pima diabetes, road (dirt), satellite images, shuttle and vehicle. In choosing the datasets, we decided on the following desiderata: 1. Datasets should be twoclass to make the evaluation easier. This desideratum was hard to satisfy and we resorted to converting several multiclass
Jun Wang. Classification Visualization with Shaded Similarity Matrix. Bei Yu Les Gasser Graduate School of Library and Information Science University of Illinois at UrbanaChampaign.
explored in the future. The purpose of this section is to see if it is effective to use simple random sampling with very small sample size. To this end, we test the ensemble classifier on 5 Statlog data sets: Satimage, Segment, Shuttle Australian, and DNA. For data description, please see Table 3. The reason to use these 5 Statlog data sets is because Ankerst used them as benchmark in his PBC system
Krzysztof Grabczewski and Wl/odzisl/aw Duch. THE SEPARABILITY OF SPLIT VALUE CRITERION. Department of Computer Methods, Nicolaus Copernicus University.
[6] kNN  99.56 [6] RBF 98.40 98.60 [6] MLP+BP 95.50 96.57 [6] Logistic discrimination 96.07 96.17 [6] Linear discrimination 95.02 95.17 [6] Table 3: Comparison of results for the NASA Shuttle dataset. Results for different systems are compared in the table ??. SSV results are much better than those obtained from the MLP or RBF networks (as reported in the Stalog project [?]) and comparable with
Mohammed Waleed Kadous and Claude Sammut. The University of New South Wales School of Computer Science and Engineering Temporal Classification: Extending the Classification Paradigm to Multivariate Time Series.
analysis technique (i.e. a technique that allows the system to cope with the problem that patterns occur at different temporal scales) and applies them to space shuttle data as well as an artificial dataset. Mannila et al [MTV95] have also been looking at temporal classification problems, in particular applying it to network traffic analysis. In their model, streams are a sequence of timelabelled
