Center for Machine Learning and Intelligent Systems
About  Citation Policy  Donate a Data Set  Contact


Repository Web            Google
View ALL Data Sets

Pima Indians Diabetes Data Set

Below are papers that cite this data set, with context shown. Papers were automatically harvested and associated with this data set, in collaboration with Rexa.info.

Return to Pima Indians Diabetes data set page.


Jeroen Eggermont and Joost N. Kok and Walter A. Kosters. Genetic Programming for data classification: partitioning the search space. SAC. 2004.

The results of our refined gp algorithm using the gain ratio criterion are again worse than those of our clustering and other refined gp algorithms. The Pima Indians diabetes Data Set On the Pima Indians diabetes data set (see Table 5) the refined gp algorithms using the gain criterion are again better than those using the gain ratio criterion. If we compare the results of our


Eibe Frank and Mark Hall. Visualizing Class Probability Estimators. PKDD. 2003.

(although they are not explicitly represented in the classifier). To provide a more realistic example Figure 8 shows four visualizations for pairs of attributes from the pima indians diabetes dataset [1]. This dataset has eight attributes and 768 instances (500 belonging to class tested_negative plas mass <= 127 mass } 127 tested_negative (132.0/3.0) <= 26.4 age } 26.4 tested_negative


Michael L. Raymer and Travis E. Doom and Leslie A. Kuhn and William F. Punch. Knowledge discovery in medical and biological datasets using a hybrid Bayes classifier/evolutionary algorithm. IEEE Transactions on Systems, Man, and Cybernetics, Part B, 33. 2003.

with the nonlinear discriminant function and the knn classifier. In all cases the nonlinear discriminant classifier is significantly faster than the EC/knn---in the case of the Pima Indian diabetes data set the difference is nearly tenfold. B. Classification of Medical Data Two additional data sets, also selected from the UCI repository, were employed by [40, 41] in a comparative study of


Marina Skurichina and Ludmila Kuncheva and Robert P W Duin. Bagging and Boosting for the Nearest Mean Classifier: Effects of Sample Size on Diversity and Accuracy. Multiple Classifier Systems. 2002.

are taken from the UCI Repository [22]. They are the 8dimensional pima diabetes data set, the 34-dimensional ionosphere data set and the 60-dimensional sonar data set. Training sets are chosen randomly and the remaining data are used for testing. All experiments are repeated 50 times on


Ilya Blayvas and Ron Kimmel. Multiresolution Approximation for Classification. CS Dept. Technion. 2002.

D). 3 Experimental Results The proposed method was implemented in VC++ 6.0 and run on `IBM PC 300 PL' with 600MHZ Pentium III processor and 256MB RAM. It was tested on the Pima Indians Diabetes dataset [10], and a large artificial dataset generated with the DatGen program [11]. The results were compared to the Smooth SVM [12] and Sparse Grids [3]. 3.1 Pima Indians The Pima Indians Diabetes


Tao Jiang and Art B. Owen. Quasi-regression for visualization and interpretation of black box functions. Department of Statistics Stanford University. 2002.

is a determination of whether a given woman is diabetic. There are 7 predictors, including medical measurements and personal history. All of the women are Pima Indians We used the version of this data set found in Ripley (1996). There are @#@# complete cases for training and $$ for a test set. The number of pregnancies was replaced by б$##2# number of pregnancies # . Then it and the other


Peter Sykacek and Stephen J. Roberts. Adaptive Classification by Variational Kalman Filtering. NIPS. 2002.

which were both used as training and independent test sets respectively. We also use the pima diabetes data set from [16] 3 . Table 1 compares the generalization accuracies (in fractions) obtained with the variational Kalman filter with generalization accuracies obtained with sequential variational inference.


Jochen Garcke and Michael Griebel and Michael Thess. Data Mining with Sparse Grids. Computing, 67. 2001.

more than 96 % of the computation time is spent for the matrix assembly. Again, the execution times scale linearly with the number of data points. 3.3 8-dimensional problem The Pima Indians Diabetes data set from Irvine Machine Learning Database Repository consists of 768 instances with 8 features plus a class label which splits the data into 2 sets with 500 instances and 268 instances respectively, see


Robert Burbidge and Matthew Trotter and Bernard F. Buxton and Sean B. Holden. STAR - Sparsity through Automated Rejection. IWANN (1). 2001.

has 270 examples in 13 dimensions. The Pima Indians diabetes data set has 768 examples in eight dimensions. These last two data sets have a high degree of overlap which leads to a dense model for the standard SVM as many training errors contribute to the solution. The


Simon Tong and Daphne Koller. Restricted Bayes Optimal Classifiers. AAAI/IAAI. 2000.

exact training error. We investigated whether using a non-zero value of oe would achieve a similar effect to that of the soft margin error function. 1 We used the Pima Indian Diabetes UC Irvine data set (Blake, Keogh, & Merz 1998) and a synthetic data set. The Pima data set has eight features, with 576 training instances of which 198 are labeled as positive. The synthetic data were generated from


Stavros J. Perantonis and Vassilis Virvilis. Input Feature Extraction for Multilayered Perceptrons Using Supervised Principal Component Analysis. Neural Processing Letters, 10. 1999.

the basis of 6 attributes originating from blood test results and daily alcohol consumption figures. The set comprises 345 patterns with 6 features for each pattern. 3. the Pima Indians Diabetes data set [15]. It comprises 768 patterns taken from patients who may show signs of diabetes. Each sample is described by 8 attributes. 4. The "Sonar Targets" dataset [16]. The task is to distinguish between


Huan Liu and Rudy Setiono. Feature Transformation and Multivariate Decision Tree Induction. Discovery Science. 1998.

those of OC1's, in which two of the OC1's trees are smaller. In 9 cases, trees by BMDT are significantly different from those of CART's, in which only one of CART's trees is smaller. An example: The dataset is Pima diabetes In Table 3, it is seen that C4.5 creates a UDT with average tree size of 122.4 nodes, BMDT builds an MDT with average tree size of 3 nodes. That means the MDT has one root and two


Thomas G. Dietterich. Approximate Statistical Test For Comparing Supervised Classification Learning Algorithms. Neural Computation, 10. 1998.

measured on the 10,000 calibration examples) matched the average performance of C4.5 to within 0.1%. For the Pima Indians Diabetes data set, we drew 1000 data sets of size 300 from the 768 available examples. For each of these data sets, the remaining 468 examples were retained for calibration. Each of the 1000 data sets of size 300 was


Kristin P. Bennett and Erin J. Bredensteiner. A Parametric Optimization Method for Machine Learning. INFORMS Journal on Computing, 9. 1997.

is available via anonymous ftp from the UCI Repository Of Machine Learning Databases [MA92]. Pima Indians Diabetes Database The Pima Diabetes dataset consists of 768 female patients who are at least 21 years of age and are of Pima Indian heritage. The 8 numeric attributes describe physical features of each patient. This dataset is also available


Jennifer A. Blue and Kristin P. Bennett. Hybrid Extreme Point Tabu Search. Department of Mathematical Sciences Rensselaer Polytechnic Institute. 1996.

are: the BUPA Liver Disease dataset (Liver); the PIMA Indians Diabetes dataset (Diabetes), the Wisconsin Breast Cancer Database (Cancer) [23], and the Cleveland Heart Disease Database (Heart) [9]. We used 5-fold cross validation. Each


Peter D. Turney. Cost-Sensitive Classification: Empirical Evaluation of a Hybrid Genetic Decision Tree Induction Algorithm. CoRR, csAI/9503102. 1995.

sum of the absolute values of the differences. The difference between two values was defined to be 1 if one or both of the two values was missing. A.4 Pima Indians Diabetes The Pima Indians Diabetes dataset was donated by Vincent Sigillito. 22 The data were collected by the National Institute of Diabetes and Digestive and Kidney Diseases. Table 21 shows the test costs for the Pima Indians Diabetes


Ilya Blayvas and Ron Kimmel. Efficient Classification via Multiresolution Training Set Approximation. CS Dept. Technion.

D). 3 Experimental Results The proposed method was implemented in VC++ 6.0 and run on `IBM PC 300 PL' with 600MHZ Pentium III processor and 256MB RAM. It was tested on the Pima Indians Diabetes dataset [10], and a large artificial dataset generated with the DatGen program [11]. The results were compared to the Smooth SVM [12] and Sparse Grids [3]. Figure 7: Partition of 2D feature space for a


Matthias Scherf and W. Brauer. Feature Selection by Means of a Feature Weighting Approach. GSF - National Research Center for Environment and Health.

of feature f2g. The application of the RBF-DDA lead to an approximately equal result, i.e. 97% classification accuracy with 152 RBF nodes. Pima Indians Diabetes Data Base The Pima Indians Diabetes data set contains 768 instances with 8 real valued features. The underlying task is to decide, whether an at least 21 year old female of Pima Indian heritage shows signs of diabetes according to World Health


Rudy Setiono and Huan Liu. Neural-Network Feature Selector. Department of Information Systems and Computer Science National University of Singapore.

or republican. We selected 197 patterns for training randomly, 21 patterns for cross-validation, and 217 patterns for testing. Schlimmer [18] reported getting an accuracy rate of 90%-95% on this dataset. 3. Pima Indians Diabetes Dataset. The dataset consists of 768 samples taken from patients who may show signs of diabetes. 15 Each sample is described by 8 attributes, 1 attribute has discrete


Christopher P. Diehl and Gert Cauwenberghs. SVM Incremental Learning, Adaptation and Optimization. Applied Physics Laboratory Johns Hopkins University.

that h = 0 when the final perturbation is complete. IV. EXPERIMENTAL RESULTS In order to assess the benefits offered by the incremental framework, we conducted two experiments using the Pima Indians dataset from the UCI machine learning repository [1]. Using an RBF kernel K(x, y) = exp # - kx - yk 2 # 2 # , we first fixed the kernel width and varied (increased or decreased) the regularization parameter


Wl odzisl/aw Duch and Rudy Setiono and Jacek M. Zurada. Computational intelligence methods for rule-based data understanding.

vector. Prototype-based rules are little known so far and computationally more costly to find, but certainly for some data, they may be simple and accurate. D. Diabetes The Pima Indian Diabetes" dataset [118] is also frequently used as a benchmark data [89], [135], [134], [136]. All patients were females, at least 21 years old, of Pima Indian heritage. 768 cases have been collected, 500 (65.1%)


Michalis K. Titsias and Aristidis Likas. Shared Kernel Models for Class Conditional Density Estimation.

and two from the UCI repository [13] Pima Indians and Ionosphere data sets). To assess the performance of the models for each problem we have selected the five-fold cross-validation method. For each problem the original set was divided into five independent parts


Lawrence O. Hall and Nitesh V. Chawla and Kevin W. Bowyer. Combining Decision Trees Learned in Parallel. Department of Computer Science and Engineering, ENB 118 University of South Florida.

The Iris data (Fisher 1936; Merz & Murphy ) which has 4 continuous valued attributes and classifies 150 examples as one of 3 classes of Iris plant. The second is the Pima Indians Diabetes data set (Merz & Murphy ) which has 8 numeric attributes and classifies 768 examples into one of 2 classes. We have done an experiment simulating a parallel 2-processor implementation for both data sets and


Charles Campbell and Nello Cristianini. Simple Learning Algorithms for Training Support Vector Machines. Dept. of Engineering Mathematics.

Service [17]. As examples of the improvements with generalisation ability which can be achieved with a soft margin we will also describe experiments with the ionosphere and Pima Indians diabetes datasets from the UCI Repository [4]. Though we have successfully used other kernels with KA we will only describe experiments using Gaussian kernels in this section. We will predominantly use the KA


Liping Wei and Russ B. Altman. An Automated System for Generating Comparative Disease Profiles and Making Diagnoses. Section on Medical Informatics Stanford University School of Medicine, MSOB X215.

diagnosis. This finding is consistent with the study by Ohno-Machado. 9 We realize that the difference may not be statistically significant and that further studies are needed. diabetes The diabetes data set consisted of 768 females patients of Pima Indian heritage who were at least 21 years old. Eight attributes were collected for each patient. A class variable was also documented, 1 as "having


Chotirat Ann and Dimitrios Gunopulos. Scaling up the Naive Bayesian Classifier: Using Decision Trees for Feature Selection. Computer Science Department University of California.

instances, 22 attributes, 2 classes. Attributes selected by SBC = 6. Pima Indians Diabetes 60 65 70 75 80 85 10203040506070809099 Training Data (%) Accuracy (%) NBC SBC C4.5 Figure 7. Pima-Indians dataset. 768 instances, 8 attributes, 2 classes. Attributes selected by SBC = 5. Promoter Gene Sequences 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 99 Training Data (%) Accuracy (%) NBC SBC C4.5


Federico Divina and Elena Marchiori. Knowledge-Based Evolutionary Search for Inductive Concept Learning. Vrije Universiteit of Amsterdam.

training examples. It can be seen that in most of the cases, the EWUS selection operator leads to a population characterized by a higher diversity. Only in two cases (the breast and the pima indians datasets) the population evolved with the use of the standard US operators has a higher diversity. However, also in these two cases, the diversity of the two populations are comparable. The WUS selection


Michael Lindenbaum and Shaul Markovitch and Dmitry Rusakov. Selective Sampling Using Random Field Modelling.

Among them there were three natural datasets: Pima Indians Diabetes dataset, Ionosphere dataset and Image Segmentation dataset, one synthetic dataset: Letters dataset and three artificial problems: Two-Spirals problem, Two-Gaussians problem


Federico Divina and Elena Marchiori. Handling Continuous Attributes in an Evolutionary Inductive Learner. Department of Computer Science Vrije Universiteit.

by the results of the t-test, summarized in Table 7. Using 1% confidence level we get that ECL-LSDc is never outperformed, while it is significantly better than the other methods on the Pima Indians dataset, better than ECL-GSD on the Breast dataset, and better than ECL-LUD on the Ionosphere dataset, together with ECL-LSDf and ECL-GSD. If we increase the confidence level to 5% then we get that ECL-LUD


Ilya Blayvas and Ron Kimmel. INVITED PAPER Special Issue on Multiresolution Analysis Machine Learning via Multiresolution Approximation.

problems with huge training sets seems to be important family of problems, it was hard to find such training sets in public databases. Our method was tested on the Pima Indians Diabetes dataset [4], a large artificial dataset generated with the DatGen program [14] and the Forest Cover Type data set. The results were compared to [3], [8], [9], [11], [12]. 3.1 Pima Indians Dataset This is an


Andrew Watkins and Jon Timmis and Lois C. Boggess. Artificial Immune Recognition System (AIRS): An ImmuneInspired Supervised Learning Algorithm. (abw5,jt6@kent.ac.uk) Computing Laboratory, University of Kent.

where classification accuracy of 98% was achieved using a k-value of 3. This seemed to bode well, and further experiments were undertaken using the Fisher Iris data set, Pima diabetes data, Ionosphere data and the Sonar data set, all obtained from the repository at the University of California at Irvine [4]. Table II shows the performance of AIRS on these data sets


Return to Pima Indians Diabetes data set page.

Supported By:

 In Collaboration With:

About  ||  Citation Policy  ||  Donation Policy  ||  Contact  ||  CML