Center for Machine Learning and Intelligent Systems
About  Citation Policy  Donate a Data Set  Contact


Repository Web            Google
View ALL Data Sets

Breast Cancer Wisconsin (Original) Data Set

Below are papers that cite this data set, with context shown. Papers were automatically harvested and associated with this data set, in collaboration with Rexa.info.

Return to Breast Cancer Wisconsin (Original) data set page.


Gavin Brown. Diversity in Neural Network Ensembles. The University of Birmingham. 2004.

critical to consider values for the strength parameter outside the originally specified range. Table 5.3 shows the classification error rates of two empirical tests, on the Wisconsin breast cancer dataset from the UCI repository (699 patterns), and the Heart disease dataset from Statlog (270 patterns). An ensemble consisting of two networks, each with five hidden nodes, was trained using NC. We use


Krzysztof Grabczewski and Wl/odzisl/aw Duch. Heterogeneous Forests of Decision Trees. ICANN. 2002.

< 1.10531) then primary hypothyroid 2. if TSH } 6.05 # FTI } 64.72 # on_thyroxine = 0 # thyroid_surgery = 0 # TT4 < 150.5 then compensated hypothyroid 3. else healthy. The Wisconsin breast cancer dataset contains 699 instances, with 458 benign (65.5%) and 241 (34.5%) malignant cases. Each instance is described by 9 attributes with integer value in the range 1-10 and a binary class label. For 16


András Antos and Balázs Kégl and Tamás Linder and Gábor Lugosi. Data-dependent margin-based generalization bounds for classification. Journal of Machine Learning Research, 3. 2002.

attributes were binary coded in a 1-out-of-n fashion. Data points with missing attributes were removed. Each attribute was normalized to have zero mean and 1= p d standard deviation. The four data sets were the Wisconsin breast cancer (n = 683, d = 9), the ionosphere (n = 351, d = 34), the Japanese credit screening (n = 653, d = 42), and the tic-tac-toe endgame (n = 958, d = 27) database. 84


Kristin P. Bennett and Ayhan Demiriz and Richard Maclin. Exploiting unlabeled data in ensemble methods. KDD. 2002.

experiments we used simple multilayer perceptrons with a single layer of hidden units. The networks were trained using backpropagation with a learning rate of 0.15 and a momentum value of 0.90. The datasets for the experiments are breast cancer wisconsin pima-indians diabetes, and letter-recognition drawn from the UCI Machine Learning repository [3]. The number of units in the hidden layer for the


Hussein A. Abbass. An evolutionary artificial neural networks approach for breast cancer diagnosis. Artificial Intelligence in Medicine, 25. 2002.

well, compared to the previous studies. In another study, Setiono [26] used his rule extraction from ANNs algorithm [28, 29] to extract useful rules that can predict breast cancer from the Wisconsin dataset. He needed first to train an ANN using BP and achieved an accuracy level on the test data of approximately 94%. After applying his rule extraction technique, the accuracy of the extracted rule set


Baback Moghaddam and Gregory Shakhnarovich. Boosted Dyadic Kernel Discriminants. NIPS. 2002.

the number of support vectors for the SVM, and #k.ev. the number of kernel evaluations required by a boosted hypercuts classifier. Means and standard deviations in 30 trials are reported for each data set. WBC,WPBC,WDBC are Wisconsin Breast Cancer Prognosis and Diagnosis data sets, respectively. In each experiment, the data set was randomly partitioned into training, validation and test sets of


Robert Burbidge and Matthew Trotter and Bernard F. Buxton and Sean B. Holden. STAR - Sparsity through Automated Rejection. IWANN (1). 2001.

available from the UCI Machine Learning Data Repository [11], are as follows. The breast cancer Wisconsin data set has 699 examples in nine dimensions and is `noise-free', one feature has 16 missing values which are replaced with the feature mean. The ionosphere data set has 351 examples in 33 dimensions and is


Nikunj C. Oza and Stuart J. Russell. Experimental comparisons of online and batch versions of bagging and boosting. KDD. 2001.

learning and its effect on ensemble performance. 6. ACKNOWLEDGEMENTS The Wisconsin Breast Cancer dataset was obtained from the University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg. The Forest Covertype is Copyrighted 1998 by Jock A. Blackard and Colorado State University. 7.


Lorne Mason and Peter L. Bartlett and Jonathan Baxter. Improved Generalization Through Explicit Optimization of Margins. Machine Learning, 38. 2000.

chosen as the final solution. In some cases the training sets were reduced in size to makeoverfitting more likely (so that complexity regularization with DOOM could have an effect). In three of the datasets (Credit Application, Wisconsin Breast Cancer and Pima Indians Diabetes), AdaBoost gained no advantage from using more than a single classifier. In these datasets, the number of classifiers was


P. S and Bradley K. P and Bennett A. Demiriz. Constrained K-Means Clustering. Microsoft Research Dept. of Mathematical Sciences One Microsoft Way Dept. of Decision Sciences and Eng. Sys. 2000.

the Johns Hopkins Ionosphere dataset and the Wisconsin Diagnostic Breast Cancer dataset (WDBC) [7]. The Ionosphere dataset contains 351 data points in R 33 and values along each dimension Contrained K-Means Clustering 6 0 5 10 15 20 25


Endre Boros and Peter Hammer and Toshihide Ibaraki and Alexander Kogan and Eddy Mayoraz and Ilya B. Muchnik. An Implementation of Logical Analysis of Data. IEEE Trans. Knowl. Data Eng, 12. 2000.

the housing value is above or below the median. Using training sets of 80% of the observations, [16] reports correct prediction rates ranging from 82% to 83.2%. Breast Cancer Wisconsin . The dataset, compiled by O. Mangasarian and K.P. Bennett, is widely used in the machine learning community for comparing learning algorithms. It is, however, difficult to use it for rigorous comparisons since


Yuh-Jeng Lee. Smooth Support Vector Machines. Preliminary Thesis Proposal Computer Sciences Department University of Wisconsin. 2000.

[37]. To evaluate the efficacy of SSVM, we compared computational times of SSVM with those of RLP and SV M k·k1 . We ran all tests on six publicly available datasets: the Wisconsin Prognostic Breast Cancer Database [34] and five datasets from the Irvine Machine Learning Database Repository [36]. It turned out that tenfold testing correctness of the SSVM is the


Justin Bradley and Kristin P. Bennett and Bennett A. Demiriz. Constrained K-Means Clustering. Microsoft Research Dept. of Mathematical Sciences One Microsoft Way Dept. of Decision Sciences and Eng. Sys. 2000.

the Johns Hopkins Ionosphere dataset and the Wisconsin Diagnostic Breast Cancer dataset (WDBC) [7]. The Ionosphere dataset contains 351 data points in R 33 and values along each dimension Contrained K-Means Clustering 6 0 5 10 15 20 25


Chun-Nan Hsu and Hilmar Schuschel and Ya-Ting Yang. The ANNIGMA-Wrapper Approach to Neural Nets Feature Selection for Knowledge Discovery and Data Mining. Institute of Information Science. 1999.

the technique presented in [10], where it is used to enhance the effectiveness of feature. An optimal result is the selection of features ``jacketcolor'', ``holding'', and ``bodyshape''. Real-world Datasets Breast Cancer Wisconsin (Cancer) This dataset has 699 instances of 10 features : one is the ID number and 9 others have values within 1 to 10. Each instance has one of the 2 possible classes:


Huan Liu and Hiroshi Motoda and Manoranjan Dash. A Monotonic Measure for Optimal Feature Selection. ECML. 1998.

with unknown relevant attributes, consists of WBC - the Wisconsin Breast Cancer data set, LED-7 - data with 7 Boolean attributes and 10 classes, the set of decimal digits (0..9), Letter - the letter image recognition data, LYM - the lymphography data, and Vote - the U.S. House of


Lorne Mason and Peter L. Bartlett and Jonathan Baxter. Direct Optimization of Margins Improves Generalization in Combined Classifiers. NIPS. 1998.

sets were reduced in size to makeoverfitting more likely, so that complexity regularization with DOOM could haveaneffect. (The details are given in the full version [MBB98].) In three of the datasets (Credit Application, Wisconsin Breast Cancer and Pima Indians Diabetes), AdaBoost gained no advantage from using more than a single classifier. In these datasets, the number of classifiers was


W. Nick Street. A Neural Network Model for Prognostic Prediction. ICML. 1998.

of the models to separate cases with favorable and unfavorable prognoses (see Section 3.3). 3 Experimental Results Computational experiments were performed on two very different breast cancer data sets. The first is known as Wisconsin Prognostic Breast Cancer (WPBC) and is characterized by a small number of cases, relatively high dimensionality, very precise values and almost no missing data. The


Yk Huhtala and Juha Kärkkäinen and Pasi Porkka and Hannu Toivonen. Efficient Discovery of Functional and Approximate Dependencies Using Partitions. ICDE. 1998.

and their descriptions are available on the UCI Machine Learning Repository [13]. The number of rows, columns, and minimal dependencies found (N ) in each database are shown in Table 1. The datasets labeled Wisconsin breast cancer Theta n" are concatenations of n copies of the Wisconsin breast cancer data. The set of dependencies is the same in all of them. To avoid duplicate rows, all


Prototype Selection for Composite Nearest Neighbor Classifiers. Department of Computer Science University of Massachusetts. 1997.

Selection. : : : : : : : : : : : : : : : : : : : : 117 4.11 Relationships between component accuracy and diversity for the Cleveland Heart Disease, LED-7 Digit, Hepatitis and Breast Cancer Wisconsin data sets for the four boosting algorithms. "c" represents the Coarse Reclassification algorithm; "d", Deliberate Misclassification; "f ", Composite Fitness; and "s" Composite Fitness--Feature Selection. : :


Kristin P. Bennett and Erin J. Bredensteiner. A Parametric Optimization Method for Machine Learning. INFORMS Journal on Computing, 9. 1997.

of the Federal Reserve Bank of Dallas [BS90], has 9 numeric features which range from 0 to 1. The data represent 4311 successful banks and 441 failed banks. Wisconsin Breast Cancer Database This dataset is used to classify a set of 682 patients with breast cancer [WM90]. Each patient is represented by nine integral attributes ranging in value from 1 to 10. The two classes represented are benign and


Rudy Setiono and Huan Liu. NeuroLinear: From neural networks to oblique decision rules. Neurocomputing, 17. 1997.

A. Detailed analysis 1: The University of Wisconsin Breast Cancer Dataset. This data set has been used as the test data for several studies on pattern classification methods using linear programming techniques [1, 13] and statistical techniques [23]. Each pattern is


Erin J. Bredensteiner and Kristin P. Bennett. Feature Minimization within Decision Trees. National Science Foundation. 1996.

attributes. Each patient is classified as to whether there is presence or absence of heart disease. There are 137 patients who have a presence of heart disease. Wisconsin breast Cancer Database This data set is used to classify 682 patients 18 with breast cancer. Each patient is represented by nine integral attributes ranging in value from 1 to 10. The two classes represented are benign and malignant:


Ismail Taha and Joydeep Ghosh. Characterization of the Wisconsin Breast cancer Database Using a Hybrid Symbolic-Connectionist System. Proceedings of ANNIE. 1996.

rule extraction techniques, the BIO-RE, Partial-RE, and Full-RE, on the breast cancer problem. The extracted rules are presented in order based on the rule ordering algorithm. 5.1 Breast-Cancer Data Set: The Wisconsin breast cancer data set has nine inputs and two output classes [26, 31]. The input features are: X 1 = Clump Thickness; X 2 = Uniformity of Cell Size; X 3 = Uniformity of Cell Shape; X


Jennifer A. Blue and Kristin P. Bennett. Hybrid Extreme Point Tabu Search. Department of Mathematical Sciences Rensselaer Polytechnic Institute. 1996.

(Liver); the PIMA Indians Diabetes dataset (Diabetes), the Wisconsin Breast Cancer Database (Cancer) [23], and the Cleveland Heart Disease Database (Heart) [9]. We used 5-fold cross validation. Each dataset was divided into 5 parts. The


Geoffrey I. Webb. OPUS: An Efficient Admissible Algorithm for Unordered Search. J. Artif. Intell. Res. (JAIR, 3. 1995.

Tic Tac Toe) disabling other pruning had little or no e®ect under best-first or depth-first search. The largest e®ects are 2.5 fold increases for the Soybean Large and Wisconsin Breast Cancer data sets under best-first search and for the Audiology, Soybean Large and Wisconsin Breast Cancer data sets under depth-first search. From these results it is apparent that while there are some data sets


Charles Campbell and Nello Cristianini. Simple Learning Algorithms for Training Support Vector Machines. Dept. of Engineering Mathematics.

include a sonar classification problem [14], the Wisconsin breast cancer dataset [35] and a database of handwritten digits collected by the US Postal Service [17]. As examples of the improvements with generalisation ability which can be achieved with a soft margin we will also


Chotirat Ann and Dimitrios Gunopulos. Scaling up the Naive Bayesian Classifier: Using Decision Trees for Feature Selection. Computer Science Department University of California.

19 classes. Attributes selected by SBC = 12. Wisconsin Breast Cancer 75 80 85 90 95 100 10203040506070809099 Training Data (%) Accuracy (%) NBC SBC C4.5 Figure 10. Wisconsin Breast Cancer dataset. 699 instances, 9 attributes, 2 classes. Attributes selected by SBC = 4. Congressional Voting Records 80 85 90 95 100 10203040506070809099 Training Data (%) Accuracy (%) NBC SBC C4.5 Figure 11.


Wl odzisl/aw Duch and Rudy Setiono and Jacek M. Zurada. Computational intelligence methods for rule-based data understanding.

data. Large number of rules will usually lead to poor generalization, and the insight into the knowledge hidden in the data will be lost. C. Wisconsin breast cancer data. The Wisconsin breast cancer dataset [132] is one of the favorite benchmark datasets for testing classifiers (Table V). Properties of cancer cells were collected for 699 cases, with 458 benign (65.5%) and 241 (34.5%) malignant cases of


Rafael S. Parpinelli and Heitor S. Lopes and Alex Alves Freitas. An Ant Colony Based System for Data Mining: Applications to Medical Data. CEFET-PR, CPGEI Av. Sete de Setembro, 3165.

AntClass 75.13% ± 6.00 5.20 ± 0.87 8.80 ± 1.89 C4.5 73.34% ± 3.21 6.2 ± 4.2 12.8 ± 9.83 Wisconsin Breast Cancer Data Set AntClass 95.47% ± 1.62 5.60 ± 0.80 12.50 ± 2.84 C4.5 95.02% ± 0.31 11.1 ± 1.45 44.1 ± 7.48 Hepatitis Data Set AntClass 88.75% ± 6.73 2.70 ± 0.46 7.50 ± 2.01 C4.5 85.96% ± 1.07 4.4 ± 0.93 8.5 ± 3.04


Wl/odzisl/aw Duch and Rafal/ Adamczak Email:duchraad@phys. uni. torun. pl. Statistical methods for construction of neural networks. Department of Computer Methods, Nicholas Copernicus University.

p i (x) - p r (x) around x for which the two distributions cross. The simplest network constructed from FDA solution gives classification error which is as good as the original FDA. For such datasets [12] as Wisconsin breast cancer hepatitis, Cleveland heart disease or diabetes the network obtains better results already before the learning process starts, but for some datasets this is not the


Rafael S. Parpinelli and Heitor S. Lopes and Alex Alves Freitas. PART FOUR: ANT COLONY OPTIMIZATION AND IMMUNE SYSTEMS Chapter X An Ant Colony Algorithm for Classification Rule Discovery. CEFET-PR, Curitiba.

2. The numbers after the "±" symbol are the standard deviations of the corresponding accuracy rates. As shown in this table, Ant-Miner discovered rules with a better accuracy rate than C4.5 in four data sets, namely Ljubljana breast cancer Wisconsin breast cancer, Hepatitis and Heart disease. In two data sets, Ljubljana breast cancer and Heart disease, the difference was quite small. In the other two


Adam H. Cannon and Lenore J. Cowen and Carey E. Priebe. Approximate Distance Classification. Department of Mathematical Sciences The Johns Hopkins University.

data before implementing the ADC classification algorithm. Here, only the raw data has been analyzed using the same procedure described above. 5 Conclusions Results on the Wisconsin breast cancer data set and the Fisher iris data set compare very well with previous work on these data. The Pima Indian diabetes results are also nearly competitive with previous work. In all three cases it should be


Andrew I. Schein and Lyle H. Ungar. A-Optimality for Active Learning of Logistic Regression Classifiers. Department of Computer and Information Science Levine Hall.

54. The lodgepole pine variety of tree happens to represent about 50% of the observations and so we merge all other tree types into a single category. The Wisconsin Diagnostic Breast Cancer (WDBC) data set consists of evaluation measurements (predictors) and final diagnosis for 569 patients. The goal is to predict the diagnosis using the measurements. The number of predictors is 30. The Thyroid Domain


Bart Baesens and Stijn Viaene and Tony Van Gestel and J. A. K Suykens and Guido Dedene and Bart De Moor and Jan Vanthienen and Katholieke Universiteit Leuven. An Empirical Assessment of Kernel Type Performance for Least Squares Support Vector Machine Classifiers. Dept. Applied Economic Sciences.

Liver Disorders (bld), German Credit (gcr), Heart Disease (hea), Johns Hopkins Ionosphere (ion), Pima Indians Diabetes (pid), Sonar (snr), Tic-Tac-Toe (ttt) and the Wisconsin Breast Cancer (wbc) data set. We start with presenting the empirical setup used to construct the LS-SVM classifier. This is followed by a discussion of the obtained results. 3.1 Constructing the LS-SVM Classifier The


Adil M. Bagirov and Alex Rubinov and A. N. Soukhojak and John Yearwood. Unsupervised and supervised data classification via nonsmooth and global optimization. School of Information Technology and Mathematical Sciences, The University of Ballarat.

The Australian credit dataset, the Wisconsin breast cancer dataset, the diabetes dataset, the heart disease dataset and the liver-disorder dataset have been used in numerical experiments. The description of these datasets can be


Rudy Setiono and Huan Liu. Neural-Network Feature Selector. Department of Information Systems and Computer Science National University of Singapore.

are described below. 1. The University of Wisconsin Breast Cancer Diagnosis Dataset. The Wisconsin Breast Cancer Data (WBCD) is a large data set that consists of 699 patterns of which 458 are benign samples and 241 are malignant samples. Each of these patterns consists of nine


Huan Liu. A Family of Efficient Rule Generators. Department of Information Systems and Computer Science National University of Singapore.

testing set are randomly selected. The rest are used for training. The data has 22 discrete attributes. Each attribute can have 2 to 10 values. ffl Wisconsin Breast Cancer The training and testing datasets contains 350 and 349 instances respectively. 350 instances are randomly selected for training, the other half is for testing. There are 9 discrete attributes. Each attribute has 10 values. The


Rudy Setiono. Extracting M-of-N Rules from Trained Neural Networks. School of Computing National University of Singapore.

of the data were converted to 126 binary inputs before training. In order to reduce computation time, only 2000 randomly selected samples were used. 4. The Wisconsin breast cancer classification dataset [17]. Each of the 699 patterns in the 16 TABLE I: The initial network topology (input, hidden and output units) and the average user time required for training and pruning. Figures in parentheses


Jarkko Salojarvi and Samuel Kaski and Janne Sinkkonen. Discriminative clustering in Fisher metrics. Neural Networks Research Centre Helsinki University of Technology.

and secondly through the density function estimate that generates the metric used to define the Fisherian Voronoi regions. IV. EXPERIMENTS Experiments were run with the Wisconsin breast cancer data set from the UCI machine learning repository [9]. The 569 samples consisted of 30 attributes, measured from malignant and benign tumors. We chose the ordinary k-means as the baseline reference method.


Wl odzisl and Rafal Adamczak and Krzysztof Grabczewski and Grzegorz Zal. A hybrid method for extraction of logical rules from data. Department of Computer Methods, Nicholas Copernicus University.

obtained from the UCI repository [14]. A. Wisconsin breast cancer data. The Wisconsin cancer dataset [17] contains 699 instances, with 458 benign (65.5%) and 241 (34.5%) malignant cases. Each instance is described by the case number, 9 attributes with integer value in the range 1-10 (for example,


Return to Breast Cancer Wisconsin (Original) data set page.

Supported By:

 In Collaboration With:

About  ||  Citation Policy  ||  Donation Policy  ||  Contact  ||  CML