Center for Machine Learning and Intelligent Systems
About  Citation Policy  Donate a Data Set  Contact


Repository Web            Google
View ALL Data Sets

Breast Cancer Data Set

Below are papers that cite this data set, with context shown. Papers were automatically harvested and associated with this data set, in collaboration with Rexa.info.

Return to Breast Cancer data set page.


Igor Fischer and Jan Poland. Amplifying the Block Matrix Structure for Spectral Clustering. Telecommunications Lab. 2005.

are common benchmark sets with real-world data (Murphy & Aha, 1994): the iris, the wine and the breast cancer data set. Both our methods perform very well on iris and breast cancer. However, the wine data set is too sparse for context-dependent method: only 178 points in 13 dimensions, giving the conductivity too


Kaizhu Huang and Haiqin Yang and Irwin King and Michael R. Lyu and Laiwan Chan. Biased Minimax Probability Machine for Medical Diagnosis. AMAI. 2004.

Then we apply it to two real-world medical diagnosis datasets, the breast cancer dataset and the heart disease dataset. 4.1. A Synthetic Dataset A two-variable synthetic dataset is generated by the two-dimensional gamma distribution. Two classes of data are


Qingping Tao Ph. D. MAKING EFFICIENT LEARNING ALGORITHMS WITH EXPONENTIALLY MANY FEATURES. Qingping Tao A DISSERTATION Faculty of The Graduate College University of Nebraska In Partial Fulfillment of Requirements. 2004.

(T 0 = n 2 and T s =10n 2 ). M - Metropolis, G - Gibbs, MG - Metropolized Gibbs, PT - Parallel Tempering, BF - Brute Force. Data Sets iris car breast cancer voting auto annealing n 4 6 9 16 25 38 M 5.3 2.1 1.7 0.831.5 5.05.0 2.1 12.8 7.5 1.0 0.7 G 6.7 3.81.9 0.8 30.9 5.5 5.0 2.415.6 7.80.6 0.5 MG 6.0 1.7


Saher Esmeir and Shaul Markovitch. Lookahead-based algorithms for anytime induction of decision trees. ICML. 2004.

LSID3 produces classifiers of higher accuracy than C4.5. For the more difficult concepts, the advantage of LSID3 is substantial. However, for some datasets, such as Breast Cancer and Monks-3, C4.5 produces trees that are both smaller and more accurate. These results confirm our expectations: the problems addressed by LSID3 and C4.5 are dierent. While


Gavin Brown. Diversity in Neural Network Ensembles. The University of Birmingham. 2004.

critical to consider values for the strength parameter outside the originally specified range. Table 5.3 shows the classification error rates of two empirical tests, on the Wisconsin breast cancer dataset from the UCI repository (699 patterns), and the Heart disease dataset from Statlog (270 patterns). An ensemble consisting of two networks, each with five hidden nodes, was trained using NC. We use


Kristin P. Bennett and Ayhan Demiriz and Richard Maclin. Exploiting unlabeled data in ensemble methods. KDD. 2002.

experiments we used simple multilayer perceptrons with a single layer of hidden units. The networks were trained using backpropagation with a learning rate of 0.15 and a momentum value of 0.90. The datasets for the experiments are breast cancer wisconsin, pima-indians diabetes, and letter-recognition drawn from the UCI Machine Learning repository [3]. The number of units in the hidden layer for the


Baback Moghaddam and Gregory Shakhnarovich. Boosted Dyadic Kernel Discriminants. NIPS. 2002.

the number of support vectors for the SVM, and #k.ev. the number of kernel evaluations required by a boosted hypercuts classifier. Means and standard deviations in 30 trials are reported for each data set. WBC,WPBC,WDBC are Wisconsin Breast Cancer Prognosis and Diagnosis data sets, respectively. In each experiment, the data set was randomly partitioned into training, validation and test sets of


Andrs Antos and Balzs Kgl and Tams Linder and Gbor Lugosi. Data-dependent margin-based generalization bounds for classification. Journal of Machine Learning Research, 3. 2002.

attributes were binary coded in a 1-out-of-n fashion. Data points with missing attributes were removed. Each attribute was normalized to have zero mean and 1= p d standard deviation. The four data sets were the Wisconsin breast cancer (n = 683, d = 9), the ionosphere (n = 351, d = 34), the Japanese credit screening (n = 653, d = 42), and the tic-tac-toe endgame (n = 958, d = 27) database. 84


Michael G. Madden. Evaluation of the Performance of the Markov Blanket Bayesian Classifier Algorithm. CoRR, csLG/0211003. 2002.

and all four are equally good on the Breast Cancer dataset. Nave TAN K2 MBBC Chess 87.63 1.61 91.68 1.09 94.03 0.87 97.03 0.54 WBCD 97.81 0.51 97.47 0.68 97.17 1.05 97.30 1.01 LED-24 73.28 0.70 73.18 0.63 73.14 0.73 73.14 0.73 DNA 94.80 0.44


Yongmei Wang and Ian H. Witten. Modeling for Optimal Probability Prediction. ICML. 2002.

(Heart Disease, Cleveland), German (Statlog Project, German Credit), Ionosphere, Pima (Pima Indian Diabetes), Spambase, and WDBC (Winsconsin Breast Cancer WDBC). The eighth is the Crab dataset from Agresti (1996). Some of the datasets were modified slightly: some attributes and instances were deleted to eliminate missing values, multi-class problems were transformed into binary ones, a


Remco R. Bouckaert. Accuracy bounds for ensembles under 0 { 1 loss. Xtal Mountain Information Technology & Computer Science Department, University of Waikato. 2002.

of 100 cases were generated and the cardinality of the variables was varied from 2 to 12, 3 Weka can be obtained from http://www.cs.waikato.ac.nz/ml/ 4 The following datasets were used: autos, balance-scale, breast cancer breast-w, horsecolic, credit-rating, german-credit, pima-diabetes, glass, heart-c, heart-h, heart-statlog, hepatitis, iris, labor, lymphography,


Krzysztof Grabczewski and Wl/odzisl/aw Duch. Heterogeneous Forests of Decision Trees. ICANN. 2002.

< 1.10531) then primary hypothyroid 2. if TSH } 6.05 # FTI } 64.72 # on_thyroxine = 0 # thyroid_surgery = 0 # TT4 < 150.5 then compensated hypothyroid 3. else healthy. The Wisconsin breast cancer dataset contains 699 instances, with 458 benign (65.5%) and 241 (34.5%) malignant cases. Each instance is described by 9 attributes with integer value in the range 1-10 and a binary class label. For 16


Hussein A. Abbass. An evolutionary artificial neural networks approach for breast cancer diagnosis. Artificial Intelligence in Medicine, 25. 2002.

well, compared to the previous studies. In another study, Setiono [26] used his rule extraction from ANNs algorithm [28, 29] to extract useful rules that can predict breast cancer from the Wisconsin dataset. He needed first to train an ANN using BP and achieved an accuracy level on the test data of approximately 94%. After applying his rule extraction technique, the accuracy of the extracted rule set


Fei Sha and Lawrence K. Saul and Daniel D. Lee. Multiplicative Updates for Nonnegative Quadratic Programming in Support Vector Machines. NIPS. 2002.

Kernel Polynomial Radial Data k=4 k=6 #=0.3 #=1.0 #=3.0 Sonar 9.6% 9.6% 7.6% 6.7% 10.6% breast cancer 5.1% 3.6% 4.4% 4.4% 4.4% Table 1: Misclassification error rates on the sonar and breast cancer data sets after 512 iterations of the multiplicative updates. 3.1 Multiplicative updates The loss function in eq. (6) is a special case of eq. (1) with A ij = y i y j K(x i , x j ) and b i =- 1. Thus, the


W. Nick Street and Yoo-Hyon Kim. A streaming ensemble algorithm (SEA) for large-scale classification. KDD. 2001.

contains 44,848 instances of which 29.3% are in the over 50k" class. # SEER breast cancer The breast cancer data set from the Surveillance, Epidemiology, and End Results (SEER) program [6] of the National Institutes of Health contains follow-up data on over 44,000 breast cancer patients. The cases were filtered to


Nikunj C. Oza and Stuart J. Russell. Experimental comparisons of online and batch versions of bagging and boosting. KDD. 2001.

We tested bagging and boosting with decision trees only on some of the smaller datasets (Promoters, Balance, Breast Cancer Car Evaluation) because the lossless decision tree algorithm is too expensive with larger datasets in online mode. Bagging and online bagging perform comparably


Bernhard Pfahringer and Geoffrey Holmes and Richard Kirkby. Optimizing the Induction of Alternating Decision Trees. PAKDD. 2001.

Instances Missing Numeric Nominal values (%) attributes UCI Datasets breast cancer 699 0.2 9 0 cleveland 303 0.2 6 7 credit 690 0.6 6 9 diabetes 768 0.0 8 0 hepatitis 155 5.4 6 13 hypothyroid 3772 5.4 7 22 ionosphere 351 0.0 34 0 kr-vs-kp 3196 0.0 0 36 labor 57 33.6


Robert Burbidge and Matthew Trotter and Bernard F. Buxton and Sean B. Holden. STAR - Sparsity through Automated Rejection. IWANN (1). 2001.

available from the UCI Machine Learning Data Repository [11], are as follows. The breast cancer Wisconsin data set has 699 examples in nine dimensions and is `noise-free', one feature has 16 missing values which are replaced with the feature mean. The ionosphere data set has 351 examples in 33 dimensions and is


Bernhard Pfahringer and Geoffrey Holmes and Gabi Schmidberger. Wrapping Boosters against Noise. Australian Joint Conference on Artificial Intelligence. 2001.

induction: 7 Table 4. Predictive error, no noise. The best entry in each line is set in boldface, a prefix star marks values that are significantly different from the value in the first column. Dataset ADTree Bagging Wrapping Wensemble BREAST CANCER 31.59 * 28.57 * 26.02 * 25.32 BREAST-W 3.83 * 3.49 * 4.32 3.63 CLEVE 21.78 * 17.15 * 17.28 * 16.03 CREDIT-A 15.10 * 13.22 15.68 15.07 CREDIT-G 25.50 *


Justin Bradley and Kristin P. Bennett and Bennett A. Demiriz. Constrained K-Means Clustering. Microsoft Research Dept. of Mathematical Sciences One Microsoft Way Dept. of Decision Sciences and Eng. Sys. 2000.

the Johns Hopkins Ionosphere dataset and the Wisconsin Diagnostic Breast Cancer dataset (WDBC) [7]. The Ionosphere dataset contains 351 data points in R 33 and values along each dimension Contrained K-Means Clustering 6 0 5 10 15 20 25


Yuh-Jeng Lee. Smooth Support Vector Machines. Preliminary Thesis Proposal Computer Sciences Department University of Wisconsin. 2000.

A medical application is also proposed here. A linear support vector machine (SVM) is used to extract 6 features from a total of 31 features in a dataset of 253 breast cancer patients. Five features are nuclear features obtained during a non-invasive diagnostic procedure while one feature, tumor size, is obtained during surgery. The linear SVM


Petri Kontkanen and Petri Myllym and Tomi Silander and Henry Tirri and Peter Gr. On predictive distributions and Bayesian networks. Department of Computer Science, Stanford University. 2000.

3 we plot the performance of the methods, averaged over 100 independent test runs performed as described above, as a function of the number of the data vectors used for training in the Breast cancer dataset case. From this picture we see that in the logscore sense, the evidence-based EVU and EVJ approaches perform surprisingly well even in 15 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 0 50 100 150 200 250 300


Kristin P. Bennett and Ayhan Demiriz and John Shawe-Taylor. A Column Generation Algorithm For Boosting. ICML. 2000.

LPBoost has a well defined stopping criterion that is reached in a few iterations. It uses few weak learners. There are only 81 possible stumps on the Breast Cancer dataset (9 attributes having 9 possible values), so clearly AdaBoost may require the same tree to be generated multiple times. LPBoost generates a weak learner only once and can alter the weight on that


Matthew Mullin and Rahul Sukthankar. Complete Cross-Validation for Nearest Neighbor Classifiers. ICML. 2000.

and Abalone-3 are twoand three-class versions of the problem, where the adjacent classes were grouped so that data was divided evenly. Abalone-3 was introduced in (Waugh, 1995). In the Breast Cancer dataset, the ID field was omitted, as was a field containing missing values. 7 Since the aim of these experiments was not to improve classification accuracy but rather to compare estimation variance and


Lorne Mason and Peter L. Bartlett and Jonathan Baxter. Improved Generalization Through Explicit Optimization of Margins. Machine Learning, 38. 2000.

chosen as the final solution. In some cases the training sets were reduced in size to makeoverfitting more likely (so that complexity regularization with DOOM could have an effect). In three of the datasets (Credit Application, Wisconsin Breast Cancer and Pima Indians Diabetes), AdaBoost gained no advantage from using more than a single classifier. In these datasets, the number of classifiers was


Endre Boros and Peter Hammer and Toshihide Ibaraki and Alexander Kogan and Eddy Mayoraz and Ilya B. Muchnik. An Implementation of Logical Analysis of Data. IEEE Trans. Knowl. Data Eng, 12. 2000.

the housing value is above or below the median. Using training sets of 80% of the observations, [16] reports correct prediction rates ranging from 82% to 83.2%. Breast Cancer (Wisconsin). The dataset, compiled by O. Mangasarian and K.P. Bennett, is widely used in the machine learning community for comparing learning algorithms. It is, however, difficult to use it for rigorous comparisons since


P. S and Bradley K. P and Bennett A. Demiriz. Constrained K-Means Clustering. Microsoft Research Dept. of Mathematical Sciences One Microsoft Way Dept. of Decision Sciences and Eng. Sys. 2000.

the Johns Hopkins Ionosphere dataset and the Wisconsin Diagnostic Breast Cancer dataset (WDBC) [7]. The Ionosphere dataset contains 351 data points in R 33 and values along each dimension Contrained K-Means Clustering 6 0 5 10 15 20 25


Sally A. Goldman and Yan Zhou. Enhancing Supervised Learning with Unlabeled Data. ICML. 2000.

just the initial labeled data (i.e. round 0). Our cotraining procedure helped both algorithms to improve their performance. Figure 2 shows the results from one of our runs using the breast cancer data set. In this data set ID3 had the better performance. Again (as we generally see), both hypotheses were improved by the co-training. 0 1 2 3 Number of co-training rounds 0.21 0.22 0.23 0.24 0.25 Error


David M J Tax and Robert P W Duin. Support vector domain description. Pattern Recognition Letters, 20. 1999.

be understood by looking at the distribution of the data, where class 2 is between classes 1 and 3. Only the instability method is able to reject objects from the second class. In the breast cancer data set, the second class is clearly easier to distinguish than the rst class. Looking at the origin of the data, this means that by describing the benign class, the malignant class can be rejected quite


Kai Ming Ting and Ian H. Witten. Issues in Stacked Generalization. J. Artif. Intell. Res. (JAIR, 10. 1999.

are given in Table 10, and indicate that the three methods are very competitive. 4 Stacking performs better than both arcing and bagging in three datasets (Waveform, Soybean and Breast Cancer , and is better than arcing but worse than bagging in the Diabetes dataset. Note that stacking performs very poorly on Glass and Ionosphere, two small


Ismail Taha and Joydeep Ghosh. Symbolic Interpretation of Artificial Neural Networks. IEEE Trans. Knowl. Data Eng, 11. 1999.

and universal approach. A rule evaluation technique that orders extracted rules based on three performance measures is then proposed. The three techniques are applied to the iris and breast cancer data sets. The extracted rules are evaluated qualitatively and quantitatively, and compared with those obtained by other approaches. Index Terms: rule extraction, hybrid systems, knowledge refinement, neural


Lorne Mason and Jonathan Baxter and Peter L. Bartlett and Marcus Frean. Boosting Algorithms as Gradient Descent. NIPS. 1999.

0% noise - AdaBoost 0% noise - DOOM II 15% noise - AdaBoost 15% noise - DOOM II Figure 2: Margin distributions for AdaBoost and DOOM II with 0% and 15% label noise for the breast cancer and splice data sets. Given that AdaBoost suffers from overfitting and minimizes an exponential cost function of the margins, this cost function certainly does not relate to test error. Howdoesthevalue of our proposed


Iaki Inza and Pedro Larraaga and Basilio Sierra and Ramon Etxeberria and Jose Antonio Lozano and Jos Manuel Pea. Representing the behaviour of supervised classification learning algorithms by Bayesian networks. Pattern Recognition Letters, 20. 1999.

1,055 cases, a sufficient amount to obtain a 'not-overfitted' Bayesian network. Figure 1 summarizes the explained process. As an example, the induced simplified Bayesian network for Breast cancer dataset can be seen in Figure 2. 3.4 Concepts for interpreting the joint behaviour Once the Bayesian networks are induced, our aim is to extract assertions on the joint behaviour of Machine Learning


David W. Opitz and Richard Maclin. Popular Ensemble Methods: An Empirical Study. J. Artif. Intell. Res. (JAIR, 11. 1999.

ensemble. Also shown (results column 3) is the "best" result produced from all of the single network results run using all of the training data. 197 Opitz & Maclin Single Bagging Arcing Boosting Data Set Err SD Best Err SD Err SD Err SD breast cancer w 5.0 0.7 4.0 3.7 0.5 3.5 0.6 3.5 0.3 credit-a 14.9 0.8 14.2 13.4 0.5 14.0 0.9 13.7 0.5 credit-g 29.6 1.0 28.7 25.2 0.7 25.9 1.0 26.7 0.4 diabetes 27.8


Chun-Nan Hsu and Hilmar Schuschel and Ya-Ting Yang. The ANNIGMA-Wrapper Approach to Neural Nets Feature Selection for Knowledge Discovery and Data Mining. Institute of Information Science. 1999.

the technique presented in [10], where it is used to enhance the effectiveness of feature. An optimal result is the selection of features ``jacketcolor'', ``holding'', and ``bodyshape''. Real-world Datasets Breast Cancer Wisconsin (Cancer) This dataset has 699 instances of 10 features : one is the ID number and 9 others have values within 1 to 10. Each instance has one of the 2 possible classes:


Huan Liu and Hiroshi Motoda and Manoranjan Dash. A Monotonic Measure for Optimal Feature Selection. ECML. 1998.

with unknown relevant attributes, consists of WBC - the Wisconsin Breast Cancer data set, LED-7 - data with 7 Boolean attributes and 10 classes, the set of decimal digits (0..9), Letter - the letter image recognition data, LYM - the lymphography data, and Vote - the U.S. House of


Yk Huhtala and Juha Krkkinen and Pasi Porkka and Hannu Toivonen. Efficient Discovery of Functional and Approximate Dependencies Using Partitions. ICDE. 1998.

and their descriptions are available on the UCI Machine Learning Repository [13]. The number of rows, columns, and minimal dependencies found (N ) in each database are shown in Table 1. The datasets labeled "Wisconsin breast cancer Theta n" are concatenations of n copies of the Wisconsin breast cancer data. The set of dependencies is the same in all of them. To avoid duplicate rows, all


W. Nick Street. A Neural Network Model for Prognostic Prediction. ICML. 1998.

of the models to separate cases with favorable and unfavorable prognoses (see Section 3.3). 3 Experimental Results Computational experiments were performed on two very different breast cancer data sets. The first is known as Wisconsin Prognostic Breast Cancer (WPBC) and is characterized by a small number of cases, relatively high dimensionality, very precise values and almost no missing data. The


Lorne Mason and Peter L. Bartlett and Jonathan Baxter. Direct Optimization of Margins Improves Generalization in Combined Classifiers. NIPS. 1998.

sets were reduced in size to makeoverfitting more likely, so that complexity regularization with DOOM could haveaneffect. (The details are given in the full version [MBB98].) In three of the datasets (Credit Application, Wisconsin Breast Cancer and Pima Indians Diabetes), AdaBoost gained no advantage from using more than a single classifier. In these datasets, the number of classifiers was


Richard Maclin. Boosting Classifiers Regionally. AAAI/IAAI. 1998.

used in this paper. Shown are the number of examples and output classes, plus the number of inputs, outputs, hidden units and training epochs used for each network. Data Set Case Out In Hid Epch breast cancer w 699 2 9 5 20 credit-a 690 2 47 10 35 credit-g 1000 2 63 10 30 diabetes 768 2 8 5 30 glass 214 6 9 10 80 heart-cleveland 303 2 13 5 40 hepatitis 155 2 32 10 60


Prototype Selection for Composite Nearest Neighbor Classifiers. Department of Computer Science University of Massachusetts. 1997.

Selection. : : : : : : : : : : : : : : : : : : : : 117 4.11 Relationships between component accuracy and diversity for the Cleveland Heart Disease, LED-7 Digit, Hepatitis and Breast Cancer Wisconsin data sets for the four boosting algorithms. "c" represents the Coarse Reclassification algorithm; "d", Deliberate Misclassification; "f ", Composite Fitness; and "s" Composite Fitness--Feature Selection. : :


Kristin P. Bennett and Erin J. Bredensteiner. A Parametric Optimization Method for Machine Learning. INFORMS Journal on Computing, 9. 1997.

of the Federal Reserve Bank of Dallas [BS90], has 9 numeric features which range from 0 to 1. The data represent 4311 successful banks and 441 failed banks. Wisconsin Breast Cancer Database This dataset is used to classify a set of 682 patients with breast cancer [WM90]. Each patient is represented by nine integral attributes ranging in value from 1 to 10. The two classes represented are benign and


Pedro Domingos. Control-Sensitive Feature Selection for Lazy Learners. Artif. Intell. Rev, 11. 1997.

used in the empirical study, in particular M. Zwitter and M. Soklic of the University Medical Centre, Ljubljana, for supplying the lymphography, breast cancer and primary tumor datasets, and Robert Detrano, of the V.A. Medical Center, Long Beach and Cleveland Clinic Foundation, for supplying the heart disease dataset. Please see the documentation in the UCI Repository for detailed


Rudy Setiono and Huan Liu. NeuroLinear: From neural networks to oblique decision rules. Neurocomputing, 17. 1997.

A. Detailed analysis 1: The University of Wisconsin Breast Cancer Dataset. This data set has been used as the test data for several studies on pattern classification methods using linear programming techniques [1, 13] and statistical techniques [23]. Each pattern is


Kamal Ali and Michael J. Pazzani. Error Reduction through Learning Multiple Descriptions. Machine Learning, 24. 1996.

search using Likelihood Combination is able to statistically significantly (95% confidence) reduce or maintain error on all domains except the (Ljubljana) breast cancer domain. On that breast cancer data set few learning methods have been able to get an accuracy significantly higher than that obtained by guessing the most frequent class suggesting it lacks the attributes relevant for discriminating the


Jennifer A. Blue and Kristin P. Bennett. Hybrid Extreme Point Tabu Search. Department of Mathematical Sciences Rensselaer Polytechnic Institute. 1996.

(Liver); the PIMA Indians Diabetes dataset (Diabetes), the Wisconsin Breast Cancer Database (Cancer) [23], and the Cleveland Heart Disease Database (Heart) [9]. We used 5-fold cross validation. Each dataset was divided into 5 parts. The


Pedro Domingos. Unifying Instance-Based and Rule-Based Induction. Machine Learning, 24. 1996.

included in the listing of empirical results in (Holte, 1993) are referred to by the same codes. In the first phase of the study, the first 15 datasets in Table 4 (from breast cancer to wine) were used to fine-tune the algorithms, choosing by 10-fold cross-validation the most accurate version of each. Since a complete factor analysis would be too


Erin J. Bredensteiner and Kristin P. Bennett. Feature Minimization within Decision Trees. National Science Foundation. 1996.

attributes. Each patient is classified as to whether there is presence or absence of heart disease. There are 137 patients who have a presence of heart disease. Wisconsin breast Cancer Database This data set is used to classify 682 patients 18 with breast cancer. Each patient is represented by nine integral attributes ranging in value from 1 to 10. The two classes represented are benign and malignant:


Ismail Taha and Joydeep Ghosh. Characterization of the Wisconsin Breast cancer Database Using a Hybrid Symbolic-Connectionist System. Proceedings of ANNIE. 1996.

and Ordering Procedure 10 3.1 The Rule Ordering Procedure : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 10 4 Output Integration 11 5 Implementation Results 12 5.1 Breast Cancer Data Set: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 12 5.2 Methodology : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 12 5.3 Rule Extraction and


Christophe Giraud and Tony Martinez and Christophe G. Giraud-Carrier. University of Bristol Department of Computer Science ILA: Combining Inductive Learning with Prior Knowledge and Reasoning. 1995.

Study Algorithm PA GR ILA 82.7 .49 ILA, T=2 73.9 .20 PDL2 79.7 .66 As expected, results with T=2 show a decrease in PA (about 10%), but also a significant decrease in GR (over 59%). For three of the datasets (zoo, breast cancer and soybean-small), the decrease in PA is less than 1.1% on average, while the decrease in GR is greater than 76%. The threshold T, though not part of the basic model, provides


Ron Kohavi. A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. IJCAI. 1995.

` is defined as the expected value minus the estimated value. An unbiased estimation method is a method that has zero bias. Figure 1 shows the bias and variance of k-fold cross-validation on several datasets (the breast cancer dataset is not shown). The diagrams clearly show that k-fold cross-validation is pessimistically biased, especially for two and five folds. For the learning curves that have a


Geoffrey I. Webb. OPUS: An Efficient Admissible Algorithm for Unordered Search. J. Artif. Intell. Res. (JAIR, 3. 1995.

Tic Tac Toe) disabling other pruning had little or no eect under best-first or depth-first search. The largest eects are 2.5 fold increases for the Soybean Large and Wisconsin Breast Cancer data sets under best-first search and for the Audiology, Soybean Large and Wisconsin Breast Cancer data sets under depth-first search. From these results it is apparent that while there are some data sets


K. A. J Doherty and Rolf Adams and Neil Davey. Unsupervised Learning with Normalised Data and Non-Euclidean Norms. University of Hertfordshire.

considered were the Ionosphere, Image Segmentation (training data), Wisconsin Diagnostic Breast Cancer (WDBC) and Wine data sets. These data sets were selected to show our approach on data with a range of classes, dimensionality and data distributions. The basic characteristics of each data set are shown in table 2. Tab l e


Adam H. Cannon and Lenore J. Cowen and Carey E. Priebe. Approximate Distance Classification. Department of Mathematical Sciences The Johns Hopkins University.

data before implementing the ADC classification algorithm. Here, only the raw data has been analyzed using the same procedure described above. 5 Conclusions Results on the Wisconsin breast cancer data set and the Fisher iris data set compare very well with previous work on these data. The Pima Indian diabetes results are also nearly competitive with previous work. In all three cases it should be


G. Ratsch and B. Scholkopf and Alex Smola and Sebastian Mika and T. Onoda and K. -R Muller. Robust Ensemble Learning for Data Mining. GMD FIRST, Kekul#estr.

generalization performance of AdaBoost in the low noise regime. However, AdaBoost performs worse than other learning machines on noisy tasks [6, 7], such as the iris and the breast cancer benchmark data sets [5]. The present paper addresses the overfitting problem of AdaBoost in two ways. Primarily, it makes an algorithmic contribution to the problem of constructing regularized boosting algorithms.


Andrew I. Schein and Lyle H. Ungar. A-Optimality for Active Learning of Logistic Regression Classifiers. Department of Computer and Information Science Levine Hall.

54. The lodgepole pine variety of tree happens to represent about 50% of the observations and so we merge all other tree types into a single category. The Wisconsin Diagnostic Breast Cancer (WDBC) data set consists of evaluation measurements (predictors) and final diagnosis for 569 patients. The goal is to predict the diagnosis using the measurements. The number of predictors is 30. The Thyroid Domain


Huan Liu. A Family of Efficient Rule Generators. Department of Information Systems and Computer Science National University of Singapore.

testing set are randomly selected. The rest are used for training. The data has 22 discrete attributes. Each attribute can have 2 to 10 values. ffl Wisconsin Breast Cancer The training and testing datasets contains 350 and 349 instances respectively. 350 instances are randomly selected for training, the other half is for testing. There are 9 discrete attributes. Each attribute has 10 values. The


Alexander K. Seewald. Dissertation Towards Understanding Stacking Studies of a General Ensemble Learning Scheme ausgefuhrt zum Zwecke der Erlangung des akademischen Grades eines Doktors der technischen Naturwissenschaften.

balance-scale Compressed glyph visualization for dataset breast cancer Compressed glyph visualization for dataset breast-w Compressed glyph visualization for dataset colic Compressed glyph visualization for dataset credit-a Compressed glyph visualization


Rafael S. Parpinelli and Heitor S. Lopes and Alex Alves Freitas. PART FOUR: ANT COLONY OPTIMIZATION AND IMMUNE SYSTEMS Chapter X An Ant Colony Algorithm for Classification Rule Discovery. CEFET-PR, Curitiba.

2. The numbers after the "" symbol are the standard deviations of the corresponding accuracy rates. As shown in this table, Ant-Miner discovered rules with a better accuracy rate than C4.5 in four data sets, namely Ljubljana breast cancer Wisconsin breast cancer, Hepatitis and Heart disease. In two data sets, Ljubljana breast cancer and Heart disease, the difference was quite small. In the other two


Paul D. Wilson and Tony R. Martinez. Combining Cross-Validation and Confidence to Measure Fitness. fonix corporation Brigham Young University.

at the bottom of Table 1, CVC had a significantly higher average generalization accuracy on this set of classification tasks than both the static and LCV methods at a 99% confidence level or higher. Dataset Anneal Australian Breast Cancer WI) Bridges Crx Echocardiogram Flag Glass Heart Heart(Cleveland) Heart(Hungarian) Heart(Long Beach) Heart(More) Heart(Swiss) Hepatitis Horse Colic Image Segmentation


Charles Campbell and Nello Cristianini. Simple Learning Algorithms for Training Support Vector Machines. Dept. of Engineering Mathematics.

include a sonar classification problem [14], the Wisconsin breast cancer dataset [35] and a database of handwritten digits collected by the US Postal Service [17]. As examples of the improvements with generalisation ability which can be achieved with a soft margin we will also


Nikunj C. Oza and Stuart J. Russell. Online Bagging and Boosting. Computer Science Division University of California.

Bagging and online bagging performed noticeably better than single decision trees on all except the Breast Cancer dataset. With Naive Bayes, bagging and online bagging never performed noticeably better than Naive Bayes, which we expected because of the stability of Naive Bayes [3]. Boosting and online boosting


Michael R. Berthold and Klaus--Peter Huber. From Radial to Rectangular Basis Functions: A new Approach for Rule Learning from Large Datasets. Institut fur Rechnerentwurf und Fehlertoleranz (Prof. D. Schmid) Universitat Karlsruhe.

C. Further Results This approach was also applied to the breast cancer dataset (see [9]), originating from a real world application. This dataset contains about 700 patterns, each pattern described by 9 real--valued attributes. Interestingly the RecBF network trained on the


Bart Baesens and Stijn Viaene and Tony Van Gestel and J. A. K Suykens and Guido Dedene and Bart De Moor and Jan Vanthienen and Katholieke Universiteit Leuven. An Empirical Assessment of Kernel Type Performance for Least Squares Support Vector Machine Classifiers. Dept. Applied Economic Sciences.

Liver Disorders (bld), German Credit (gcr), Heart Disease (hea), Johns Hopkins Ionosphere (ion), Pima Indians Diabetes (pid), Sonar (snr), Tic-Tac-Toe (ttt) and the Wisconsin Breast Cancer (wbc) data set. We start with presenting the empirical setup used to construct the LS-SVM classifier. This is followed by a discussion of the obtained results. 3.1 Constructing the LS-SVM Classifier The


Rudy Setiono and Huan Liu. Neural-Network Feature Selector. Department of Information Systems and Computer Science National University of Singapore.

are described below. 1. The University of Wisconsin Breast Cancer Diagnosis Dataset. The Wisconsin Breast Cancer Data (WBCD) is a large data set that consists of 699 patterns of which 458 are benign samples and 241 are malignant samples. Each of these patterns consists of nine


Rafael S. Parpinelli and Heitor S. Lopes and Alex Alves Freitas. An Ant Colony Based System for Data Mining: Applications to Medical Data. CEFET-PR, CPGEI Av. Sete de Setembro, 3165.

node). Each of these generated intervals is considered a discrete value for the attribute being discretized. (See the above reference for details.) The AntMiner system was tested using the following datasets: . Ljubljana breast cancer this database has 282 cases, two classes and nine predicting attributes (all categorical); . Wisconsin breast cancer: This database has 683 cases, two classes and nine


Wl odzisl and Rafal Adamczak and Krzysztof Grabczewski and Grzegorz Zal. A hybrid method for extraction of logical rules from data. Department of Computer Methods, Nicholas Copernicus University.

obtained from the UCI repository [14]. A. Wisconsin breast cancer data. The Wisconsin cancer dataset [17] contains 699 instances, with 458 benign (65.5%) and 241 (34.5%) malignant cases. Each instance is described by the case number, 9 attributes with integer value in the range 1-10 (for example,


Jarkko Salojarvi and Samuel Kaski and Janne Sinkkonen. Discriminative clustering in Fisher metrics. Neural Networks Research Centre Helsinki University of Technology.

and secondly through the density function estimate that generates the metric used to define the Fisherian Voronoi regions. IV. EXPERIMENTS Experiments were run with the Wisconsin breast cancer data set from the UCI machine learning repository [9]. The 569 samples consisted of 30 attributes, measured from malignant and benign tumors. We chose the ordinary k-means as the baseline reference method.


Rudy Setiono. Extracting M-of-N Rules from Trained Neural Networks. School of Computing National University of Singapore.

of the data were converted to 126 binary inputs before training. In order to reduce computation time, only 2000 randomly selected samples were used. 4. The Wisconsin breast cancer classification dataset [17]. Each of the 699 patterns in the 16 TABLE I: The initial network topology (input, hidden and output units) and the average user time required for training and pruning. Figures in parentheses


Ayhan Demiriz and Kristin P. Bennett and John Shawe and I. Nouretdinov V.. Linear Programming Boosting via Column Generation. Dept. of Decision Sciences and Eng. Systems, Rensselaer Polytechnic Institute.

criterion for stopping when an optimal ensemble is found that is reached in relatively few iterations. It uses few weak hypotheses. There are only 81 possible stumps on the Breast Cancer dataset (nine attributes having nine possible values), so clearly AdaBoost may require the same tree to be generated multiple times. LPBoost generates a weak hypothesis only once and can alter the weight on


Liping Wei and Russ B. Altman. An Automated System for Generating Comparative Disease Profiles and Making Diagnoses. Section on Medical Informatics Stanford University School of Medicine, MSOB X215.

profile instead of using all attributes in the original clinical data. The results remain the same. RESULTS We evaluated the system by applying it to heart disease, diabetes, and breast cancer All data sets were obtained from the UCI Repository of Machine Learning databases and domain theories. 7 Heart Disease Four clinical data sets were used. These sets consists of patients who had been referred for


Chotirat Ann and Dimitrios Gunopulos. Scaling up the Naive Bayesian Classifier: Using Decision Trees for Feature Selection. Computer Science Department University of California.

19 classes. Attributes selected by SBC = 12. Wisconsin Breast Cancer 75 80 85 90 95 100 10203040506070809099 Training Data (%) Accuracy (%) NBC SBC C4.5 Figure 10. Wisconsin Breast Cancer dataset. 699 instances, 9 attributes, 2 classes. Attributes selected by SBC = 4. Congressional Voting Records 80 85 90 95 100 10203040506070809099 Training Data (%) Accuracy (%) NBC SBC C4.5 Figure 11.


John W. Chinneck. Fast Heuristics for the Maximum Feasible Subsystem Problem. Systems and Computer Engineering, Carleton University.

Data Set Net Points Number of Features breast cancer 683 9 bupa 345 6 glass (type 2 vs.others) 214 9 ionosphere 351 34 iris (versicolor vs.others) 150 4 iris (virginica vs.others) 150 4 new thyroid (normal


Sherrie L. W and Zijian Zheng. A BENCHMARK FOR CLASSIFIER LEARNING. Basser Department of Computer Science The University of Sydney.

in the first quartile of all the candidate values. Large and high mean in the last quartile of all the candidate values. Medium means between the small (or low) and the large (or high). Table 1. Datasets in the benchmark Name Description Breast Cancer (W) Medical diagnosis applied to breast cytology (Wisconsin) Diabetes Pima Indians diabetes database for diagnosing diabetes Hepatitis Predicting


M. A. Galway and Michael G. Madden. DEPARTMENT OF INFORMATION TECHNOLOGY technical report NUIG-IT-011002 Evaluation of the Performance of the Markov Blanket Bayesian Classifier Algorithm. Department of Information Technology National University of Ireland, Galway.

and all four are equally good on the Breast Cancer dataset. Nave TAN K2 MBBC Chess 87.63 1.61 91.68 1.09 94.03 0.87 97.03 0.54 WBCD 97.81 0.51 97.47 0.68 97.17 1.05 97.30 1.01 LED-24 73.28 0.70 73.18 0.63 73.14 0.73 73.14 0.73 DNA 94.80 0.44


John G. Cleary and Leonard E. Trigg. Experiences with OB1, An Optimal Bayes Decision Tree Learner. Department of Computer Science University of Waikato.

all the information in vote is contained in one attribute, and for iris two attributes contain all the class information (although most of this can be obtained using only one attribute). Some datasets, such as breast cancer and credit-g appear to contain very little class information. In general, we expect to see OB1 performance increase with tree depth up to a depth that captures the most


Wl/odzisl/aw Duch and Rafal/ Adamczak Email:duchraad@phys. uni. torun. pl. Statistical methods for construction of neural networks. Department of Computer Methods, Nicholas Copernicus University.

p i (x) - p r (x) around x for which the two distributions cross. The simplest network constructed from FDA solution gives classification error which is as good as the original FDA. For such datasets [12] as Wisconsin breast cancer hepatitis, Cleveland heart disease or diabetes the network obtains better results already before the learning process starts, but for some datasets this is not the


Rong-En Fan and P. -H Chen and C. -J Lin. Working Set Selection Using the Second Order Information for Training SVM. Department of Computer Science and Information Engineering National Taiwan University.

The data sets image, diabetes, covtype, breast cancer and abalone are from the UCI machine learning repository (Blake and Merz, 1998). Problems a1a and a9a are compiled in (Platt, 1998) from the UCI "adult"


Rong Jin and Yan Liu and Luo Si and Jaime Carbonell and Alexander G. Hauptmann. A New Boosting Algorithm Using Input-Dependent Regularizer. School of Computer Science, Carnegie Mellon University.

with 10% noise. From the results we can see that AdaBoost algorithm did suer from overfitting on some of the data sets, such as "German", Breast cancer and "Contraceptive", while WeightBoost consistently achieved improvement on all of the eight data sets. In addition, our new algorithm demonstrates great


David Kwartowitz and Sean Brophy and Horace Mann. Session S2D Work In Progress: Establishing multiple contexts for student's progressive refinement of data mining.

version of WEKA became available. Students used this version to complete an end of semester project that asked them to compare and contrast three data mining techniques to analyze the Breast Cancer Data set. Students reported having little difficulty understanding how to use the software and spent most of their time making decisions about how to prepare the data for analysis and analyzing the results.


Geoffrey I Webb. Generality is more significant than complexity: Toward an alternative to Occam's Razor. School of Computing and Mathematics Deakin University.

from the UCI repository of machine learning data sets (Murphy & Aha, 1994): breast cancer 5 echocardiogram, glass type, hepatitis, house votes 84, hypothyroid, iris, lymphography, primary tumor, and soybean large. For all of these data sets, the


Karthik Ramakrishnan. UNIVERSITY OF MINNESOTA.

the number of output classes, and the number of continuous and discrete input features. Features Data set Cases Class Continuous Discrete breast cancer w 699 2 9 - credit-a 690 2 6 9 credit-g 1000 2 7 13 glass 214 6 9 - heart-cleveland 303 2 8 5 hypo 3772 5 7 22 ionosphere 351 2 34 - iris 159 3 4 -


Geoffrey I Webb. Learning Decision Lists by Prepending Inferred Rules. School of Computing and Mathematics Deakin University.

supported by the Australian Research Council. I am grateful to Mike Cammeron-Jones for discussions that helped refine the ideas presented herein. The Breast Cancer Lymphography and Primary Tumor data sets were compiled by M. Zwitter and M. Soklic at University Medical Centre, Institute of Oncology, Ljubljana, Yugoslavia. The Audiology data set was compiled by Professor Jergen at Baylor College of


Adil M. Bagirov and Alex Rubinov and A. N. Soukhojak and John Yearwood. Unsupervised and supervised data classification via nonsmooth and global optimization. School of Information Technology and Mathematical Sciences, The University of Ballarat.

The Australian credit dataset, the Wisconsin breast cancer dataset, the diabetes dataset, the heart disease dataset and the liver-disorder dataset have been used in numerical experiments. The description of these datasets can be


M. V. Fidelis and Heitor S. Lopes and Alex Alves Freitas. Discovering Comprehensible Classification Rules with a Genetic Algorithm. UEPG, CPD CEFET-PR, CPGEI PUC-PR, PPGIA Praa Santos Andrade, s/n Av. Sete de Setembro.

in the medical domains of dermatology and breast cancer These data sets were obtained from the UCI (University of California at Irvine) - Machine Learning Repository [17]. These data sets have been used extensively for classification tasks using different paradigms,


Chris Drummond and Robert C. Holte. C4.5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling beats Over-Sampling. Institute for Information Technology, National Research Council Canada.

Cost Function PCF(+) Normalized Expected Cost 0.4 0.6 0.8 1.0 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.0 0.2 Figure 3. Credit: Comparing Sampling Schemes breast cancer data set from the Institute of Oncology, Ljubljana. It has 286 instances, 201 non-recurrences and 85 recurrences, with 9 nominal attributes. For this data set, C4.5 only marginally outperforms the cheapest


Wl odzisl/aw Duch and Rudy Setiono and Jacek M. Zurada. Computational intelligence methods for rule-based data understanding.

data. Large number of rules will usually lead to poor generalization, and the insight into the knowledge hidden in the data will be lost. C. Wisconsin breast cancer data. The Wisconsin breast cancer dataset [132] is one of the favorite benchmark datasets for testing classifiers (Table V). Properties of cancer cells were collected for 699 cases, with 458 benign (65.5%) and 241 (34.5%) malignant cases of


Maria Salamo and Elisabet Golobardes. Analysing Rough Sets weighting methods for Case-Based Reasoning Systems. Enginyeria i Arquitectura La Salle.

are from our own repository. They deal with diagnosis of breast cancer and synthetic datasets. Datasets related to diagnosis are biopsy and mammogram. Biopsy is the result of digitally processed biopsy images, whereas mammogram consists of detecting breast cancer using the N


Chiranjib Bhattacharyya. Robust Classification of noisy data using Second Order Cone Programming approach. Dept. Computer Science and Automation, Indian Institute of Science.

website[9]. Ionosphere, sonar and wiconsin breast cancer were the three different datasets. The ionosphere dataset contains 34 dimensional observations, which are obtained from radar signals, while the sonar dataset contains 60 dimensional observation vectors. The wisconsin dataset


G. Ratsch and B. Scholkopf and Alex Smola and K. -R Muller and T. Onoda and Sebastian Mika. Arc: Ensemble Learning in the Presence of Outliers. GMD FIRST.

[17] explains the good generalization performance of AdaBoost in the low noise regime. However, AdaBoost performs worse on noisy tasks [10, 11], such as the iris and the breast cancer benchmark data sets [1]. On the latter tasks, a large margin on all training points cannot be achieved without adverse effects on the generalization error. This experimental observation was supported by the study of


D. Randall Wilson and Roel Martinez. Improved Center Point Selection for Probabilistic Neural Networks. Proceedings of the International Conference on Artificial Neural Networks and Genetic Algorithms.

reduction in size can be even more dramatic when there are more instances available. This is especially true when the number of instances is large compared to the complexity of the decision surface. Dataset Anneal Audiology Australian Breast Cancer (WI) Bridges Crx Echocardiogram Flag Heart (Hungarian) Heart (More) Heart Heart (Swiss) Hepatitis Horse-Colic Iris Liver-Bupa Pima-Indians-Diabetes


Return to Breast Cancer data set page.

Supported By:

 In Collaboration With:

About  ||  Citation Policy  ||  Donation Policy  ||  Contact  ||  CML