Thyroid Disease Data Set
Below are papers that cite this data set, with context shown.
Papers were automatically harvested and associated with this data set, in collaboration with Rexa.info.
Return to Thyroid Disease data set page.
Ken Tang and Ponnuthurai N. Suganthan and Xi Yao and A. Kai Qin. Linear dimensionalityreduction using relevance weighted LDA. School of Electrical and Electronic Engineering Nanyang Technological University. 2005.
to compare LDA, aPAC, WLDR, EWLDR. The six data sets are landsat, optdigits, vehicle, DNA, thyroid disease and vowel data sets. Landsat. The Landsat data set is generated from landsat multi-spectral scanner image data. It has 36 dimensions, 4435
Zhi-Hua Zhou and Yuan Jiang. NeC4.5: Neural Ensemble Based C4.5. IEEE Trans. Knowl. Data Eng, 16. 2004.
To explore this issue, further experiments are performed on data sets (australian, page, thyroid voting, wdbc, wpbc) where neither NeC4.5 with µ = 100% nor NeC4.5 with µ = 0% is significantly more accurate than C4.5. The results are depicted in Fig.s 1 and 2. Note
Xiaoyong Chai and Li Deng and Qiang Yang and Charles X. Ling. Test-Cost Sensitive Naive Bayes Classification. ICDM. 2004.
attributes of datasets attributes Ecoli 6 Breast 9 Heart 8 Thyroid 24 Australia 15 Cars 6 Voting 16 Mushroom 22 Table 2. Datasets used in the experiments We ran a 3-fold cross validation on these data sets. In the
Vassilis Athitsos and Stan Sclaroff. Boosting Nearest Neighbor Classifiers for Multiclass Recognition. Boston University Computer Science Tech. Report No, 2004-006. 2004.
We did not use four datasets (dermatology,soybean, thyroid audiology) because they have missing attributes, which our current formulation cannot handle. One dataset (ecoli) contains a nominal attribute, whichour current
Michael L. Raymer and Travis E. Doom and Leslie A. Kuhn and William F. Punch. Knowledge discovery in medical and biological datasets using a hybrid Bayes classifier/evolutionary algorithm. IEEE Transactions on Systems, Man, and Cybernetics, Part B, 33. 2003.
from the UCI repository, were employed by [40, 41] in a comparative study of classification methods from statistical pattern recognition, neural networks, and machine learning. These two medical data sets, thyroid and appendicitis, are included here to facilitate comparison with these results. The thyroid data consists of 21 clinical test results for a set of patients tested for thyroid dysfunction
Lukasz A. Kurgan and Waldemar Swiercz and Krzysztof J. Cios. Semantic Mapping of XML Tags Using Inductive Machine Learning. ICMLA. 2002.
domains 73.1 course 5 10 15.6 2.3 87.5 faculty 5 10 100 0 100.0 realest 5 10 27.4 20 57.2 mean for real-life domains 81.6 total mean 77.3 Similarly, the thy domain created using the thyroid dataset [28, 29] includes attributes that have very different attribute names, and very similar attribute values between the two sources. 92 percent of the examples from that domain belong to the same
Qiang Yang and Jing Wu. Enhancing the Effectiveness of Interactive Case-Based Reasoning with Clustering and Decision Forests. Appl. Intell, 14. 2001.
from the UCI Repository of Machine Learning Databases and Domain Theories [Keogh et al. 1998] at the University of California at Irvine. Two data sets are first used for our experiments (the Thyroid Disease Database and the Mushroom (agaricus-lepiota) database). In addition, in order to test the system under the condition that there are missing
Erin L. Allwein and Robert E. Schapire and Yoram Singer. Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers. ICML. 2000.
tested were used and then evaluated with Hamming decoding and the appropriate loss-based decoding for SVM. skipped Audiology, Isolet, Letter-recognition, Segmentation, and Thyroid because these datasets were either too big to be handled by our current implementation of SVM or contained many nominal features with missing values which are problematic for SVM. All datasets have at least six classes.
Petri Kontkanen and Jussi Lahtinen and Petri Myllymäki and Henry Tirri. Unsupervised Bayesian visualization of high-dimensional data. KDD. 2000.
purposes a reasonable approximation is usually quite sufficient. How to find effectively good approximations of the optimal visualization is however a wide 327 Figure 1: The Thyroid Disease data set: an example of the unsupervised visualizations obtained with the suggested method. research problem on its own, and is not discussed in detail here. In the experiments reported here we used a simple
Andreas L. Prodromidis. On the Management of Distributed Learning Agents Ph.D. Thesis Proposal CUCS-032-97. Department of Computer Science Columbia University. 1998.
+ Marmalade.cs + Mango.cs Control & Data messages Transfer of Learning & Classifier Agents Database Configuration Database Configuration Strawberry.cs Data Site - 2 Mango.cs DATA SITES: thyroid DATASET = META_LEARNER = Bayes CROSS_VALIDATION_FOLD = 2 META_LEARNING_FOLD = 2 META_LEARNING_LEVEL = 1 IMAGE_URL = http://www.cs.... Configuration File LEARNER = ID3 The JAM architecture with 3 datasites
Ethem Alpaydin. Voting over Multiple Condensed Nearest Neighbors. Artif. Intell. Rev, 11. 1997.
are given in Table 2. OCR is a handwritten digit database (Guyon et al., 1989). Others are available from the UCI Repository (Murphy, 1994). In the OCR, VOWEL, and THYROID datasets, the training and test sets are separated. In others, we chose the training set small for not to have too large accuracy with NN thus leaving space for improvement. Euclidean distance is used as
Kai Ming Ting and Boon Toh Low. Model Combination in the Multiple-Data-Batches Scenario. ECML. 1997.
are on or off, plus seventeen irrelevant binary attributes. Each attribute value is inverted with a probability of 0.1. The task is to classify the input as one of the ten digits. The euthyroid dataset is one of the sets of Thyroid examples from the Garvan Institute of Medical Research in Sydney described in Quinlan, Compton, Horn and Lazarus (1987). It consists of 3163 case data and diagnoses for
Salvatore J. Stolfo and Andreas L. Prodromidis and Shelley Tselepis and Wenke Lee and David W. Fan and Philip K. Chan. JAM: Java Agents for Meta-Learning over Distributed Databases. KDD. 1997.
is a medical database with records (Merz & Murphy 1996), noted by thyroid in the Data Set panel. Other parameters include the host of the CFM, the CrossValidation Fold, the Meta-Learning Fold, the MetaLearning Level, the names of the local learning agent and the local meta-learning
Peter D. Turney. Cost-Sensitive Classification: Empirical Evaluation of a Hybrid Genetic Decision Tree Induction Algorithm. CoRR, csAI/9503102. 1995.
test costs has some limitations. As it is TURNEY 398 currently implemented, it does not handle the cost of attributes that are calculated from other attributes. For example, in the Thyroid dataset (Appendix A.5), the FTI test is calculated based on the results of the TT4 and T4U tests. If the FTI test is selected, we must pay for the TT4 and T4U tests. If the TT4 and T4U tests have already
George H. John and Ron Kohavi and Karl Pfleger. Irrelevant Features and the Subset Selection Problem. ICML. 1994.
performance was on parity5+5 and CorrAL using stepwise backward elimination, which reduced the error to 0% from 50% and 18.8% respectively. Experiments were also run on the Iris, Thyroid and Monk1* datasets. The results on these datasets were similar to those reported in this paper. We observed high variance in the 25-fold crossvalidation estimates of the error. Since our algorithms depend on
Wl/odzisl/aw Duch and Rafal Adamczak and Krzysztof Grabczewski. Extraction of crisp logical rules from medical datasets. Department of Computer Methods, Nicholas Copernicus University.
generated 24 rules, but these rules used more features and were not so accurate (98% on the test set) as MLP2LN rules. TABLE I Classification results for various classifiers applied to the thyroid dataset. Method Training set accuracy % Test set accuracy % BP+conjugate gradient 94.6 93.8 Best Backpropagation 99.1 97.6 RPROP 99.6 98.0 Quickprop 99.6 98.3 BP+ genetic optimization 99.4 98.4 Local
Sherrie L. W and Zijian Zheng. A BENCHMARK FOR CLASSIFIER LEARNING. Basser Department of Computer Science The University of Sydney.
None, e.g. Lymphography and NetTalk (Phoneme) ffl Few (between 0 and 5.6%), e.g. Mushroom (1.39%) and Breast Cancer (W) (0.25%) ffl Many (more than 5.6%), e.g. Soybean (9.78%) and Thyroid (6.74%) 7. Dataset size (3 values): ffl Small (less than 210), e.g. Promoter (106) and Lymphography (148) ffl Medium (between 210 and 3170), e.g. Diabetes (768) and Thyroid (3163) ffl Large (more than 3170), e.g.
Pramod Viswanath and M. Narasimha Murty and Shalabh Bhatnagar. Partition Based Pattern Synthesis Technique with Efficient Algorithms for Nearest Neighbor Classification. Department of Computer Science and Automation, Indian Institute of Science.
We performed experiments with five different datasets, viz., OCR, WINE, VOWEL, THYROID GLASS and PENDIGITS, respectively. Except the OCR dataset, all others are from the UCI Repository . OCR dataset is also used in [20,18]. The properties of the
Wl/odzisl/aw Duch and Rafal/ Adamczak Email:duchraad@phys. uni. torun. pl. Statistical methods for construction of neural networks. Department of Computer Methods, Nicholas Copernicus University.
show that it would certainly be worthwhile to try. 3. Logical rules for network construction. Table 1: Classification results for a number of optimized MLP training algorithms applied to the thyroid dataset -- only the best results are shown. BP = Backpropagation. Method Training % Test k-NN (Manhattan) -- 93.8 Bayes rule  97.0 96.1 BP+conjugate gradient 94.6 93.8 Best BP 99.1 97.6 RPROP 99.6 98.0
Wl odzisl/aw Duch and Rudy Setiono and Jacek M. Zurada. Computational intelligence methods for rule-based data understanding.
Simple logical rules are also quite competitive in this case, allowing for understanding of important factors that determine the diagnosis. E. The hypothyroid data. This is a somewhat larger medical dataset , containing screening tests for thyroid problems. The training data have 3772 medical records collected in the first year, and the test data have 3428 cases collected in the next year of the
Je Scott and Mahesan Niranjan and Richard W. Prager. Realisable Classifiers: Improving Operating Performance on Variable Cost Problems. Cambridge University Department of Engineering.
nature of the classifiers used to form it. The classifiers are random variables, whose central tendency will be to lie on the MRROC. 312 British Machine Vision Conference 3.2 thyroid data A medical data set describing patients with abnormal thyroid conditions was obtained from the UCI machine learning repository . The data was originally contained 7200 instances, with had 3 classes, hyperthyroid,
Pramod Viswanath and M. Narasimha Murty and Shalabh Bhatnagar. A pattern synthesis technique to reduce the curse of dimensionality effect. E-mail.
We performed experiments with five different datasets, viz., OCR, WINE, THYROID GLASS and PENDIGITS, respectively. Except the OCR dataset, all others are from the UCI Repository . OCR dataset is also used in [17, 18]. The properties of the
H. Altay Guvenir. A Classification Learning Algorithm Robust to Irrelevant Features. Bilkent University, Department of Computer Engineering and Information Science.
VFI5 1NN 3NN 5NN 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Number of irrelevant features added 0.5 0.6 0.7 0.8 0.9 1.0 Classification accuracy New thyroid data set VFI5 1NN 3NN 5NN 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Number of irrelevant features added 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Classification accuracy Vehicle data set VFI5
Kai Ming Ting and Boon Toh Low. Theory Combination: an alternative to Data Combination. University of Waikato.
is as follows. The euthyroid dataset is one of the sets of Thyroid examples from the Garvan Institute of Medical Research in Sydney described in Quinlan, Compton, Horn and Lazarus (1987). It consists of 3163 case data and diagnoses for
Michael L. Raymer and William F. Punch and Erik D. Goodman and Leslie A. Kuhn and Anil K. Jain. Brief Papers.
only 49.7%, approximately equivalent to random class assignment. C. Discussion The integrated feature extraction and classification approach described has proved effective on these three disparate data sets. For the thyroid data, the GA--knn was more effective than all but two of the approaches, but required only three of the features to make the classification. The GA--knn obtained a classification
Andrew I. Schein and Lyle H. Ungar. A-Optimality for Active Learning of Logistic Regression Classifiers. Department of Computer and Information Science Levine Hall.
chosen from the UC Irvine data repository (Blake & Merz, 1998): Forest Cover Type (FCT), Wisconsin Diagnostic Breast Cancer (WDBC), Splice Junction Gene Sequence (SJGS), and Thyroid Domain (TD). The data sets were converted to a binary classification task by merging all but the most representative class label into a single class. Table 1 describes the data set characteristics after formatting while the