Center for Machine Learning and Intelligent Systems
About  Citation Policy  Donate a Data Set  Contact


Repository Web            Google
View ALL Data Sets

Diabetes Data Set

Below are papers that cite this data set, with context shown. Papers were automatically harvested and associated with this data set, in collaboration with Rexa.info.

Return to Diabetes data set page.


Prem Melville and Raymond J. Mooney. Diverse ensembles for active learning. ICML. 2004.

In particular, we used a sample size of two for the primary dataset, and three for breast-w, soybean, diabetes vowel and credit-g. The primary aim of active learning is to reduce the amount of training data needed to induce an accurate model. To evaluate this, we


Jeroen Eggermont and Joost N. Kok and Walter A. Kosters. Genetic Programming for data classification: partitioning the search space. SAC. 2004.

The results of our refined gp algorithm using the gain ratio criterion are again worse than those of our clustering and other refined gp algorithms. The Pima Indians diabetes Data Set On the Pima Indians diabetes data set (see Table 5) the refined gp algorithms using the gain criterion are again better than those using the gain ratio criterion. If we compare the results of our


Zhi-Hua Zhou and Yuan Jiang. NeC4.5: Neural Ensemble Based C4.5. IEEE Trans. Knowl. Data Eng, 16. 2004.

ensemble. Moreover, Table III shows that the generalization ability of NeC4.5 with ĩ = 0% is still better than that of C4.5. In detail, pairwise two-tailed t-tests indicate that there are seven data sets (cleveland, diabetes ionosphere, liver, sonar, waveform21, and waveform40) where NeC4.5 with ĩ = 0% is significantly more accurate than C4.5, while there is no significant difference on the


Zhihua Zhang and James T. Kwok and Dit-Yan Yeung. Parametric Distance Metric Learning with Label Information. IJCAI. 2003.

(Numbers in bold indicate the better results). data set Euclidean metric learned metric diabetes 459/638 480/638 soybean 37/37 37/37 wine 85/118 117/118 WBC 412/469 446/469 ionosphere 168/251 221/251 iris 107/120 110/120 without changing the clustering


Michael L. Raymer and Travis E. Doom and Leslie A. Kuhn and William F. Punch. Knowledge discovery in medical and biological datasets using a hybrid Bayes classifier/evolutionary algorithm. IEEE Transactions on Systems, Man, and Cybernetics, Part B, 33. 2003.

with the nonlinear discriminant function and the knn classifier. In all cases the nonlinear discriminant classifier is significantly faster than the EC/knn---in the case of the Pima Indian diabetes data set the difference is nearly tenfold. B. Classification of Medical Data Two additional data sets, also selected from the UCI repository, were employed by [40, 41] in a comparative study of


Eibe Frank and Mark Hall. Visualizing Class Probability Estimators. PKDD. 2003.

(although they are not explicitly represented in the classifier). To provide a more realistic example Figure 8 shows four visualizations for pairs of attributes from the pima-indians diabetes dataset [1]. This dataset has eight attributes and 768 instances (500 belonging to class tested_negative plas mass <= 127 mass } 127 tested_negative (132.0/3.0) <= 26.4 age } 26.4 tested_negative


Krzysztof Krawiec. Genetic Programming-based Construction of Features for Machine Learning and Knowledge Discovery Tasks. Institute of Computing Science, Poznan University of Technology. 2002.

in favor of feature construction is usually statistically relevant. Note also that positive results have been obtained for both real-world problems (Crx, Diabetes and Glass) as well as artificial datasets, which were intentionally designed to test the usefulness of feature construction methods [34]. Although the increases in accuracy of classification are not always impressive, the feature


Ilya Blayvas and Ron Kimmel. Multiresolution Approximation for Classification. CS Dept. Technion. 2002.

ˇ D). 3 Experimental Results The proposed method was implemented in VC++ 6.0 and run on `IBM PC 300 PL' with 600MHZ Pentium III processor and 256MB RAM. It was tested on the Pima Indians Diabetes dataset [10], and a large artificial dataset generated with the DatGen program [11]. The results were compared to the Smooth SVM [12] and Sparse Grids [3]. 3.1 Pima Indians The Pima Indians Diabetes


Peter Sykacek and Stephen J. Roberts. Adaptive Classification by Variational Kalman Filtering. NIPS. 2002.

which were both used as training and independent test sets respectively. We also use the pima diabetes data set from [16] 3 . Table 1 compares the generalization accuracies (in fractions) obtained with the variational Kalman filter with generalization accuracies obtained with sequential variational inference.


Kristin P. Bennett and Ayhan Demiriz and Richard Maclin. Exploiting unlabeled data in ensemble methods. KDD. 2002.

for the experiments are breast-cancer-wisconsin, pima-indians diabetes and letter-recognition drawn from the UCI Machine Learning repository [3]. The number of units in the hidden layer for the datasets was 5 for the breast-cancer and diabetes datasets and 40 in the letter-recognition dataset. The number of training epochs was set to 20 for breastcancer, 30 for diabetes, and 30 for letter


Marina Skurichina and Ludmila Kuncheva and Robert P W Duin. Bagging and Boosting for the Nearest Mean Classifier: Effects of Sample Size on Diversity and Accuracy. Multiple Classifier Systems. 2002.

are taken from the UCI Repository [22]. They are the 8dimensional pima diabetes data set, the 34-dimensional ionosphere data set and the 60-dimensional sonar data set. Training sets are chosen randomly and the remaining data are used for testing. All experiments are repeated 50 times on


Robert Burbidge and Matthew Trotter and Bernard F. Buxton and Sean B. Holden. STAR - Sparsity through Automated Rejection. IWANN (1). 2001.

has 270 examples in 13 dimensions. The Pima Indians diabetes data set has 768 examples in eight dimensions. These last two data sets have a high degree of overlap which leads to a dense model for the standard SVM as many training errors contribute to the solution. The


Jochen Garcke and Michael Griebel and Michael Thess. Data Mining with Sparse Grids. Computing, 67. 2001.

more than 96 % of the computation time is spent for the matrix assembly. Again, the execution times scale linearly with the number of data points. 3.3 8-dimensional problem The Pima Indians Diabetes data set from Irvine Machine Learning Database Repository consists of 768 instances with 8 features plus a class label which splits the data into 2 sets with 500 instances and 268 instances respectively, see


Peter L. Hammer and Alexander Kogan and Bruno Simeone and Sandor Szedm'ak. R u t c o r Research R e p o r t. Rutgers Center for Operations Research Rutgers University. 2001.

are obtained by lexicographically Page 28 RRR 7-2001 Figure 1: Cost of Classification Inaccuracy for # = 0 0 5 10 15 20 25 30 Credit Breast Cancer Boston Housing Diabetes Heart Disease Oil Voting Datasets Mean Cost LAD StrongSpanned StrongPrime Prime Figure 2: Cost of Classification Inaccuracy for # = 0.5 0 5 10 15 20 25 30 35 40 Credit Breast Cancer Boston Housing Diabetes Heart Disease Oil Voting


Chris Drummond and Robert C. Holte. Exploiting the Cost (In)sensitivity of Decision Tree Splitting Criteria. ICML. 2000.

for DKM but is very dependent on the ratio for accuracy. Figure 7 shows the range of points generated by the middle eight of the twelve ratios using an unpruned decision tree on the diabetes data set. The limits of this range are indicated by the numbers. The dashed line is accuracy, points are well spread out across ROC space. For DKM the spread is much narrower, consistent with a low


Mark A. Hall. Correlation-based Feature Selection for Discrete and Numeric Class Machine Learning. ICML. 2000.

in accuracy. From Figure 1 it can be seen that ReliefF selects fewer attributes with a threshold of 0.01 than with a threshold of 0, but CFS selects significantly fewer attributes than both on all data sets except diabetes 0 5 10 15 20 25 30 35 40 0 2 4 6 8 10 12 14 16 number of features dataset Figure 1. Average number of features selected by ReliefF with threshold 0 (left), ReliefF with threshold


Endre Boros and Peter Hammer and Toshihide Ibaraki and Alexander Kogan and Eddy Mayoraz and Ilya B. Muchnik. An Implementation of Logical Analysis of Data. IEEE Trans. Knowl. Data Eng, 12. 2000.

consistent. According to [12], correct predictions rates reported in the literature about this dataset range from 84% to 95.6%. Diabetes This dataset, compiled by the National Institute of Diabetes and Digestive and Kidney Diseases, was contributed to the repository by V. Sigillito. The dataset


Simon Tong and Daphne Koller. Restricted Bayes Optimal Classifiers. AAAI/IAAI. 2000.

exact training error. We investigated whether using a non-zero value of oe would achieve a similar effect to that of the soft margin error function. 1 We used the "Pima Indian Diabetes UC Irvine data set (Blake, Keogh, & Merz 1998) and a synthetic data set. The Pima data set has eight features, with 576 training instances of which 198 are labeled as positive. The synthetic data were generated from


Marina Skurichina and Robert P W Duin. Boosting in Linear Discriminant Analysis. Multiple Classifier Systems. 2000.

(Data II) with 225 and 126 objects belonging to the first and the second data class, respectively. The second is the 8-dimensional diabetes data set (Data III) consisting of 500 and 268 objects from the first and the second data class, respectively. These two data sets were also used in [8], when studying bagging and boosting for decision trees.


Iņaki Inza and Pedro Larraņaga and Basilio Sierra and Ramon Etxeberria and Jose Antonio Lozano and Jos Manuel Peņa. Representing the behaviour of supervised classification learning algorithms by Bayesian networks. Pattern Recognition Letters, 20. 1999.

treatment is done for unknown values, exploiting each algorithm its own characteristics. PEBLS and HOODG algorithms are not able to handle unknown values: thus, they are only used in the four datasets without unknown values diabetes heart, liver and lymphography). For each database and algorithm, a classification model is induced using the specified training set: when run with fixed default


Kai Ming Ting and Ian H. Witten. Issues in Stacked Generalization. J. Artif. Intell. Res. (JAIR, 10. 1999.

(Waveform, Soybean and Breast Cancer), and is better than arcing but worse than bagging in the Diabetes dataset. Note that stacking performs very poorly on Glass and Ionosphere, two small real-world datasets. This is not surprising, because cross-validation inevitably produces poor estimates for small


Stavros J. Perantonis and Vassilis Virvilis. Input Feature Extraction for Multilayered Perceptrons Using Supervised Principal Component Analysis. Neural Processing Letters, 10. 1999.

the basis of 6 attributes originating from blood test results and daily alcohol consumption figures. The set comprises 345 patterns with 6 features for each pattern. 3. the "Pima Indians Diabetes data set [15]. It comprises 768 patterns taken from patients who may show signs of diabetes. Each sample is described by 8 attributes. 4. The "Sonar Targets" dataset [16]. The task is to distinguish between


Art B. Owen. Tubular neighbors for regression and classification. Stanford University. 1999.

10 runs of 13-fold cross-validation using 60 inputs and may not compare directly to the one run of 13-fold cross-validation used here on the first 10 principal components. 7.5 Diabetes data This data set is from the Irvine repository. The response variable is a determination of whether a given woman is diabetic. There are 8 predictors, 26 Model ` k CV-13 Acc. f SqrLin 54 ffi 81 25:10 88:0 f Sqr2 45


Wojciech Kwedlo and Marek Kretowski. Discovery of Decision Rules from Databases: An Evolutionary Approach. PKDD. 1998.

experimental results merely entitle us to conclude, that the classification accuracy of the current version of EDRL is comparable to that of C4.5. A real improvement was observed only for diabetes dataset. However we believe that the performance of our system can be further improved. Several directions of future research exist. Currently all the continuous-valued features are discretized globally [4]


Thomas G. Dietterich. Approximate Statistical Test For Comparing Supervised Classification Learning Algorithms. Neural Computation, 10. 1998.

measured on the 10,000 calibration examples) matched the average performance of C4.5 to within 0.1%. For the Pima Indians Diabetes data set, we drew 1000 data sets of size 300 from the 768 available examples. For each of these data sets, the remaining 468 examples were retained for calibration. Each of the 1000 data sets of size 300 was


Huan Liu and Rudy Setiono. Feature Transformation and Multivariate Decision Tree Induction. Discovery Science. 1998.

those of OC1's, in which two of the OC1's trees are smaller. In 9 cases, trees by BMDT are significantly different from those of CART's, in which only one of CART's trees is smaller. An example: The dataset is Pima diabetes In Table 3, it is seen that C4.5 creates a UDT with average tree size of 122.4 nodes, BMDT builds an MDT with average tree size of 3 nodes. That means the MDT has one root and two


Prototype Selection for Composite Nearest Neighbor Classifiers. Department of Computer Science University of Massachusetts. 1997.

Fitness--Feature Selection. : : : : : : : : : : : : : : : : 116 4.10 Relationships between component accuracy and diversity for the Monks-2, Breast Cancer Ljubljana, Diabetes and Iris Plants data sets for the four boosting algorithms. "c" represents the Coarse Reclassification algorithm; "d", Deliberate Misclassification; "f ", Composite Fitness; and "s" Composite Fitness--Feature Selection. : :


Jan C. Bioch and D. Meer and Rob Potharst. Bivariate Decision Trees. PKDD. 1997.

with the standard error. From these table we can conclude 10 name cases attr classes glass 214 9 6 diabetes pima) 768 8 2 breast cancer 699 9 2 heart 270 13 2 wave 300 21 3 Table 1: Summary of the Datasets method glass diabetes cancer heart wave BIT1 65.3Sigma1:1 74.3Sigma0:7 95.4Sigma0:3 78.5Sigma0:3 76.1Sigma1:3 6.2Sigma2:1 5.2Sigma2:5 2.8Sigma0:2 4.1Sigma0:5 5.0Sigma1:6 BIT2


Kristin P. Bennett and Erin J. Bredensteiner. A Parametric Optimization Method for Machine Learning. INFORMS Journal on Computing, 9. 1997.

is available via anonymous ftp from the UCI Repository Of Machine Learning Databases [MA92]. Pima Indians Diabetes Database The Pima Diabetes dataset consists of 768 female patients who are at least 21 years of age and are of Pima Indian heritage. The 8 numeric attributes describe physical features of each patient. This dataset is also available


Jennifer A. Blue and Kristin P. Bennett. Hybrid Extreme Point Tabu Search. Department of Mathematical Sciences Rensselaer Polytechnic Institute. 1996.

are: the BUPA Liver Disease dataset (Liver); the PIMA Indians Diabetes dataset (Diabetes), the Wisconsin Breast Cancer Database (Cancer) [23], and the Cleveland Heart Disease Database (Heart) [9]. We used 5-fold cross validation. Each


Peter D. Turney. Cost-Sensitive Classification: Empirical Evaluation of a Hybrid Genetic Decision Tree Induction Algorithm. CoRR, csAI/9503102. 1995.

sum of the absolute values of the differences. The difference between two values was defined to be 1 if one or both of the two values was missing. A.4 Pima Indians Diabetes The Pima Indians Diabetes dataset was donated by Vincent Sigillito. 22 The data were collected by the National Institute of Diabetes and Digestive and Kidney Diseases. Table 21 shows the test costs for the Pima Indians Diabetes


Rong-En Fan and P. -H Chen and C. -J Lin. Working Set Selection Using the Second Order Information for Training SVM. Department of Computer Science and Information Engineering National Taiwan University.

Data statistics are in Tables 1 and 3. Problems german.numer and australian are from the Statlog collection (Michie et al., 1994). We select space ga and cadata from StatLib (http://lib.stat.cmu.edu/datasets). The data sets image, diabetes covtype, breast-cancer, and abalone are from the UCI machine learning repository (Blake and Merz, 1998). Problems a1a and a9a are compiled in (Platt, 1998) from the


Alexander K. Seewald. Dissertation Towards Understanding Stacking Studies of a General Ensemble Learning Scheme ausgefuhrt zum Zwecke der Erlangung des akademischen Grades eines Doktors der technischen Naturwissenschaften.

credit-g Compressed glyph visualization for dataset diabetes Compressed glyph visualization for dataset glass Compressed glyph visualization for dataset heart-c Compressed glyph visualization for dataset heart-h Compressed glyph visualization for


Lawrence O. Hall and Nitesh V. Chawla and Kevin W. Bowyer. Combining Decision Trees Learned in Parallel. Department of Computer Science and Engineering, ENB 118 University of South Florida.

The Iris data (Fisher 1936; Merz & Murphy ) which has 4 continuous valued attributes and classifies 150 examples as one of 3 classes of Iris plant. The second is the Pima Indians Diabetes data set (Merz & Murphy ) which has 8 numeric attributes and classifies 768 examples into one of 2 classes. We have done an experiment simulating a parallel 2-processor implementation for both data sets and


Ahmed Hussain Khan and Intensive Care. Multiplier-Free Feedforward Networks. 174.

forward-pass capability. It differs from the conventional model in restricting its synapses to the set{- 1, 0, 1} while allowing unrestricted offsets. Simulation results on the `onset of diabetes data set and a handwritten numeral recognition database indicate that the new network, despite having strong constraints on its synapses, has a generalization performance similar to that of its conventional


Andrew Watkins and Jon Timmis and Lois C. Boggess. Artificial Immune Recognition System (AIRS): An ImmuneInspired Supervised Learning Algorithm. (abw5,jt6@kent.ac.uk) Computing Laboratory, University of Kent.

where classification accuracy of 98% was achieved using a k-value of 3. This seemed to bode well, and further experiments were undertaken using the Fisher Iris data set, Pima diabetes data, Ionosphere data and the Sonar data set, all obtained from the repository at the University of California at Irvine [4]. Table II shows the performance of AIRS on these data sets


Stefan R uping. A Simple Method For Estimating Conditional Probabilities For SVMs. CS Department, AI Unit Dortmund University.

including 7 data sets from the UCI Repository [9] (covtype, diabetes digits, digits, ionosphere, liver, mushroom, promoters) and 4 other real-world data sets: a business cycle analysis problem (business), an analysis


Adil M. Bagirov and John Yearwood. A new nonsmooth optimization algorithm for clustering. Centre for Informatics and Applied Optimization, School of Information Technology and Mathematical Sciences, University of Ballarat.

13 768/768 274.73 1.0 13 389/768 115.44 1.5 14 283/768 113.64 2.0 13 215/768 66.59 4.0 11 94/768 18.45 6.0 5 52/768 2.31 8.0 5 38/768 2.02 The results presented in Table 1 show that for the diabetes data set we can take c 2 [0, 4]. Further decrease of c leads to sharp changes in the cluster structure of the data set. We can see that there are dierences in the number of clusters when c 2 [0, 4]. But for


Adil M. Bagirov and Alex Rubinov and A. N. Soukhojak and John Yearwood. Unsupervised and supervised data classification via nonsmooth and global optimization. School of Information Technology and Mathematical Sciences, The University of Ballarat.

local optimization (Discrete gradient method, see Section 4). For testing the efficiency of the combination of k-means and Discrete Gradient method, we use four well-known medium-size test datasets: Australian credit dataset, Diabetes dataset, Liver disorder dataset and Vehicle dataset. The description of these datasets can be found in Appendix. We studied these datasets, using different


Rudy Setiono and Huan Liu. Neural-Network Feature Selector. Department of Information Systems and Computer Science National University of Singapore.

or republican. We selected 197 patterns for training randomly, 21 patterns for cross-validation, and 217 patterns for testing. Schlimmer [18] reported getting an accuracy rate of 90%-95% on this dataset. 3. Pima Indians Diabetes Dataset. The dataset consists of 768 samples taken from patients who may show signs of diabetes. 15 Each sample is described by 8 attributes, 1 attribute has discrete


Charles Campbell and Nello Cristianini. Simple Learning Algorithms for Training Support Vector Machines. Dept. of Engineering Mathematics.

Service [17]. As examples of the improvements with generalisation ability which can be achieved with a soft margin we will also describe experiments with the ionosphere and Pima Indians diabetes datasets from the UCI Repository [4]. Though we have successfully used other kernels with KA we will only describe experiments using Gaussian kernels in this section. We will predominantly use the KA


Michael Lindenbaum and Shaul Markovitch and Dmitry Rusakov. Selective Sampling Using Random Field Modelling.

Among them there were three natural datasets: Pima Indians Diabetes dataset, Ionosphere dataset and Image Segmentation dataset, one synthetic dataset: Letters dataset and three artificial problems: Two-Spirals problem, Two-Gaussians problem


Prem Melville and Raymond J. Mooney. Proceedings of the 21st International Conference on Machine Learning. Department of Computer Sciences.

In particular, we used a sample size of two for the primary dataset, and three for breast-w, soybean, diabetes vowel and credit-g. The primary aim of active learning is to reduce the amount of training data needed to induce an accurate model. Toevaluate this, we


Fran ois Poulet. Cooperation between automatic algorithms, interactive algorithms and visualization tools for Visual Data Mining. ESIEA Recherche.

hyperplane and the accuracy of the algorithm. We visualize the intersection of this hyperplane with the 2D scatter plot matrices, i.e. a line in each matrix (as shown in Figure 7 with the diabetes data set, from the UCI repository). As we can see on figure 7, the resulting lines do not necessarily separate the two classes, the hyperplane does separate the data (the accuracy of the incremental SVM is


Wl odzisl/aw Duch and Rudy Setiono and Jacek M. Zurada. Computational intelligence methods for rule-based data understanding.

vector. Prototype-based rules are little known so far and computationally more costly to find, but certainly for some data, they may be simple and accurate. D. Diabetes The "Pima Indian Diabetes" dataset [118] is also frequently used as a benchmark data [89], [135], [134], [136]. All patients were females, at least 21 years old, of Pima Indian heritage. 768 cases have been collected, 500 (65.1%)


Liping Wei and Russ B. Altman. An Automated System for Generating Comparative Disease Profiles and Making Diagnoses. Section on Medical Informatics Stanford University School of Medicine, MSOB X215.

profile instead of using all attributes in the original clinical data. The results remain the same. RESULTS We evaluated the system by applying it to heart disease, diabetes and breast cancer. All data sets were obtained from the UCI Repository of Machine Learning databases and domain theories. 7 Heart Disease Four clinical data sets were used. These sets consists of patients who had been referred for


Ilya Blayvas and Ron Kimmel. INVITED PAPER Special Issue on Multiresolution Analysis Machine Learning via Multiresolution Approximation.

problems with huge training sets seems to be important family of problems, it was hard to find such training sets in public databases. Our method was tested on the Pima Indians Diabetes dataset [4], a large artificial dataset generated with the DatGen program [14] and the Forest Cover Type data set. The results were compared to [3], [8], [9], [11], [12]. 3.1 Pima Indians Dataset This is an


YongSeog Kim and W. Nick Street and Filippo Menczer. Optimal Ensemble Construction via Meta-Evolutionary Ensembles. Business Information Systems, Utah State University.

and slightly better performance in the other data sets: diabetes votes-84, and hypo. Compared to the traditional ensembles (Bagging and Boosting), MEE also shows superior performance. In comparison to Bagging, MEE demonstrates significantly better


Krzysztof Grabczewski and Wl/odzisl/aw Duch. THE SEPARABILITY OF SPLIT VALUE CRITERION. Department of Computer Methods, Nicolaus Copernicus University.

in form of logical rules. Method Accuracy % Reference Logdisc 77.7 Statlog SSV Tree 74.8 this paper CART 74.5 Stalog C4.5 73.0 Stalog Default 65.1 Table 1: Crossvalidation results for diabetes dataset Method Accuracy % Reference 3-NN 96.7 Karol Grudzi ´ nski (our group) MLP+BP 96.0 Sigillito [7] C4.5 94.9 Hamilton [8] FSM 92.8 Rafal/ Adamczak (our group) [9] SSV Tree 92.0 this paper DB-CART 91.3


Ilya Blayvas and Ron Kimmel. Efficient Classification via Multiresolution Training Set Approximation. CS Dept. Technion.

ˇ D). 3 Experimental Results The proposed method was implemented in VC++ 6.0 and run on `IBM PC 300 PL' with 600MHZ Pentium III processor and 256MB RAM. It was tested on the Pima Indians Diabetes dataset [10], and a large artificial dataset generated with the DatGen program [11]. The results were compared to the Smooth SVM [12] and Sparse Grids [3]. Figure 7: Partition of 2D feature space for a


Hussein A. Abbass. Pareto Neuro-Evolution: Constructing Ensemble of Neural Networks Using Multi-objective Optimization. Artificial Life and Adaptive Robotics (A.L.A.R.) Lab, School of Information Technology and Electrical Engineering, Australian Defence Force Academy.

size 25, the learning rate for BP 0.003, the number of hidden units is set to 5, and the number of epochs BP was applied to an individual is set to 5 for each subset incase of Prob1. The diabetes dataset has 768 patterns; 500 belonging to the first class and 268 to the second. It contains 8 attributes. The classification problem is difficult as the class value is a binarized form of another


Matthias Scherf and W. Brauer. Feature Selection by Means of a Feature Weighting Approach. GSF - National Research Center for Environment and Health.

of feature f2g. The application of the RBF-DDA lead to an approximately equal result, i.e. 97% classification accuracy with 152 RBF nodes. Pima Indians Diabetes Data Base The Pima Indians Diabetes data set contains 768 instances with 8 real valued features. The underlying task is to decide, whether an at least 21 year old female of Pima Indian heritage shows signs of diabetes according to World Health


Lena Kallin. Receiver operating characteristic (ROC) analysis Evaluating discriminance effects among decision support systems. Contents 1 The Theory of Receiver Operating Characteristic Curves 5.

function, and age (years). Our data consists of 375 non Diabetes and 201 Diabetes cases used in the learning phase, and, respectively, 125 non-Diabetes and 67 Diabetes cases in the testing phase. A data set where all missing data are set to 0.5 will be used, see [Eklund and Kallin Westin, 2002] for details about the data set and its missing data. The first data set is special in the sense that the test


Return to Diabetes data set page.

Supported By:

 In Collaboration With:

About  ||  Citation Policy  ||  Donation Policy  ||  Contact  ||  CML