Center for Machine Learning and Intelligent Systems
About  Citation Policy  Donate a Data Set  Contact

Repository Web            Google
View ALL Data Sets

Hepatitis Data Set

Below are papers that cite this data set, with context shown. Papers were automatically harvested and associated with this data set, in collaboration with

Return to Hepatitis data set page.

Amaury Habrard and Marc Bernard and Marc Sebban. IOS Press Detecting Irrelevant Subtrees to Improve Probabilistic Learning from Tree-structured Data. Fundamenta Informaticae. 2004.

has 1000 examples with 42 different leaves, but has larger trees than the previous one. . We also use a sample proposed for the PKDD'02 discovery challenge 1 (dataset on hepatitis ; the transformation of data into trees is described in [19]. This dataset has 4000 examples with 253 leaves. Our experimental setup consists in adding noise in each dataset, as

Jinyan Li and Limsoon Wong. Using Rules to Analyse Bio-medical Data: A Comparison between C4.5 and PCL. WAIM. 2003.

(hypothyroid); and Boosting won the best accuracy on 4 data sets (i.e., hepatitis lymph, sick and splice). -- Comparing between PCL and C4.5, PCL won on 8 data sets, while C4.5 won on the rest 2 data sets. -- Comparing between PCL and Bagging, PCL won on 6 data

Michael L. Raymer and Travis E. Doom and Leslie A. Kuhn and William F. Punch. Knowledge discovery in medical and biological datasets using a hybrid Bayes classifier/evolutionary algorithm. IEEE Transactions on Systems, Man, and Cybernetics, Part B, 33. 2003.

used for this evaluation are IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS 7 described in detail in [3], and at the UCI website [33]. A brief synopsis of each data set follows: Hepatitis -- This data consists of 19 descriptive and clinical test result values for 155 hepatitis patients [34, 35]. The two classes, survivors and patients for whom the hepatitis proved

Zhi-Hua Zhou and Yuan Jiang and Shifu Chen. Extracting symbolic rules from trained neural network ensembles. AI Commun, 16. 2003.

80 2 19 13 6 iris plant iris 150 3 4 0 4 statlog australian credit approval credit-a 690 2 15 9 6 statlog german credit credit-g 1,000 2 20 13 7 Table 2 Fidelity of rules extracted via REFNE data set balance voting hepatitis iris credit-a credit-g average fidelity 87.88% 89.26% 84.50% 96.25% 84.13% 74.10% 86.02% Table 3 Comparison of generalization error data set REFNE ensemble single NN C4.5

Xiaoli Z. Fern and Carla Brodley. Boosting Lazy Decision Trees. ICML. 2003.

but behave less consistently. For three data sets, Hepatitis Lympho and Monk2, bagging significantly degrades the performance of the base learner. This is possibly caused by the sub-sampling procedure used by bagging to generate different

Takashi Matsuda and Hiroshi Motoda and Tetsuya Yoshida and Takashi Washio. Mining Patterns from Structured Data by Beam-Wise Graph-Based Induction. Discovery Science. 2002.

attributes and predictive accuracy was evaluated. The best result obtained by this approach was better than the previously best known result. B-GBI was then applied to a real-world data, Hepatitis dataset provided by Chiba University. Our very preliminary results indicate that B-GBI can actually handle graphs with a few thousands nodes and extract discriminatory patterns. 1 Introduction Over the last

Wl/odzisl/aw Duch and Karol Grudzinski. Ensembles of Similarity-based Models. Intelligent Information Systems. 2001.

(except for the data described here we have tried sonar and hepatitis datasets from UCI [12]) the improvements have been insignificant. This shows that an ensemble of models of similar types may sometimes fail to improve the results. One reason for this may come from

Petri Kontkanen and Petri Myllym and Tomi Silander and Henry Tirri and Peter Gr. On predictive distributions and Bayesian networks. Department of Computer Science, Stanford University. 2000.

2 9 Iris (IR) 150 5 3 5 Lymphography (LY) 148 19 4 5 Australian (AU) 690 15 2 10 Breast Cancer (BC) 286 10 2 11 Diabetes (DB) 768 9 2 12 Glass (GL) 214 10 6 7 Hepatitis (HE) 150 20 2 5 Table 1: The datasets used in the experiments For comparing the predictive accuracy of different predictive distributions, we used two different utility functions: the log-score and the 0/1-score. The log-score of a

Gary M. Weiss and Haym Hirsh. A Quantitative Study of Small Disjuncts: Experiments and Results. Department of Computer Science Rutgers University. 2000.

in the high-ER/medium-EC group, which starts with the Hepatitis dataset, show more improvement, but have more room for improvement due to their higher error rate. The datasets in the high-ER/low-EC group, which start with the Coding dataset, show a net increase in error

David W. Opitz and Richard Maclin. Popular Ensemble Methods: An Empirical Study. J. Artif. Intell. Res. (JAIR, 11. 1999.

then the decision-tree ensemble methods also had lower (or higher) error than their neural network counterpart. The exceptions to this rule generally happened on the same data set for all three ensemble methods (e.g., hepatitis soybean, satellite, credit-a, and heart-cleveland). These results suggest that (a) the performance of the ensemble methods is dependent on both the

Yk Huhtala and Juha Kärkkäinen and Pasi Porkka and Hannu Toivonen. Efficient Discovery of Functional and Approximate Dependencies Using Partitions. ICDE. 1998.

out of memory. Table 2 shows performance results for TANE/MEM in the approximate dependency discovery task, for different thresholds ''. Results for the Hepatitis Wisconsin breast cancer, and Chess data sets are also presented graphically in Figure 3: N '' =N 0 stands for the number of approximate dependencies found relative to the case for functional dependencies; similarly, Time '' =Time 0 denotes

Floriana Esposito and Donato Malerba and Giovanni Semeraro. A Comparative Analysis of Methods for Pruning Decision Trees. IEEE Trans. Pattern Anal. Mach. Intell, 19. 1997.

This means that methods requiring a pruning set labor under a disadvantage. Nonetheless, the misclassification rate of the OPTT is not always lower than the error rate of the OPGT. Hence, in some data sets, like Hepatitis Hungary, and Switzerland above, grown trees can be better starting points for pruning processes than trained trees. Finally, the standard error reported in Table 3 confirms the

Prototype Selection for Composite Nearest Neighbor Classifiers. Department of Computer Science University of Massachusetts. 1997.

Selection. : : : : : : : : : : : : : : : : : : : : 117 4.11 Relationships between component accuracy and diversity for the Cleveland Heart Disease, LED-7 Digit, Hepatitis and Breast Cancer Wisconsin data sets for the four boosting algorithms. "c" represents the Coarse Reclassification algorithm; "d", Deliberate Misclassification; "f ", Composite Fitness; and "s" Composite Fitness--Feature Selection. : :

Ron Kohavi. The Power of Decision Tables. ECML. 1995.

and achieves similar performance on nine out of the 22 datasets (australian, cleve, crx, german, hepatitis horse-colic, iris, lymphography, and soybean). Running times on a Sparc 10 varied from about one minute for the Monk datasets to 15 hours for the dna

Peter D. Turney. Cost-Sensitive Classification: Empirical Evaluation of a Hybrid Genetic Decision Tree Induction Algorithm. CoRR, csAI/9503102. 1995.

The smallest dataset of the five we examine here is the Hepatitis dataset, which has 155 cases. The training sets had 103 cases and the testing sets had 52 cases. The sub-training and sub-testing sets had 51 or 52

Christophe Giraud and Tony Martinez and Christophe G. Giraud-Carrier. University of Bristol Department of Computer Science ILA: Combining Inductive Learning with Prior Knowledge and Reasoning. 1995.

1. If representative voted 'no' on the 'physician-fee-freeze' issue, then rep. is a democrat hepatitis dataset: 1. If patient is between 21 and 30, then patient lives 2. If patient is between 51 and 60, is a male, uses steroids, has malaise, has a liver that is big and firm, has high bilirubin and high

Gabor Melli. A Lazy Model-Based Approach to On-Line Classification. University of British Columbia. 1989.

DBPredictor achieved a higher error rate on five datasets: liver-disease, hepatitis heart-c, credit-g,andechocardiogram. Based on this evidence pruning appears to significantly lower DBPredictor's vulnerability to overspecialization. CHAPTER 7. EMPIRICAL


shuttle-exp consists of the complete set of 278 instances resulting from expanding the 15 rules of the shuttle-landing-control (shuttle-l-c) dataset. 6 Reported results for hepatitis and shuttle-exp were gathered using 10-way cross validation. Results for the Monk problems used the provided training and test sets, and results for shuttle-l-c

Federico Divina and Elena Marchiori. Handling Continuous Attributes in an Evolutionary Inductive Learner. Department of Computer Science Vrije Universiteit.

in most of the cases, with simplicity (that is, the number of clauses of the output program) that is second best after ECL-GSD. ECL-LSDf produces best results on the Echocardiogram and Hepatitis dataset, ECL-GSD on Glass2, but the results are only slightly better than those of ECL-LSDc. The unsupervised variant ECL-LUD produces satisfactory approximate solutions, yet of quality inferior to that of

Zhi-Hua Zhou and Xu-Ying Liu. Training Cost-Sensitive Neural Networks with Methods Addressing the Class Imbalance Problem.

Qav Qav echocardiogram .779 ± .135 Multi-class data set Cost (a) Cost (b) Cost (c) hepatitis .552 ± .201 lymphography .365 ± .092 .375 ± .170 .308 ± .142 heart s .774 ± .083 glass .615 ± .108 .615 ± .128 .638 ± .134 heart .790 ± .092 waveform .815 ± .037

Rafael S. Parpinelli and Heitor S. Lopes and Alex Alves Freitas. PART FOUR: ANT COLONY OPTIMIZATION AND IMMUNE SYSTEMS Chapter X An Ant Colony Algorithm for Classification Rule Discovery. CEFET-PR, Curitiba.

namely Ljubljana breast cancer, Wisconsin breast cancer, Hepatitis and Heart disease. In two data sets, Ljubljana breast cancer and Heart disease, the difference was quite small. In the other two data sets, Wisconsin breast cancer and Hepatitis, the difference was more relevant. Note that although

Wl/odzisl/aw Duch and Rafal Adamczak and Geerd H. F Diercksen. Neural Networks from Similarity Based Perspective. Department of Computer Methods, Nicholas Copernicus University.

85.5% (with 20 neurons), and inserting a new value that does not appear in the data, such as -100, decreased accuracy to 81.5% (using 22 neurons). The same behavior has been observed for Hepatitis dataset taken from the same source. the data contains 155 vectors, 18 attributes, 13 of them are binary, other have integer values. The last attribute has 67 missing values, attribute 16 has 29 missing

Wl/odzisl/aw Duch and Karol Grudzinski and Geerd H. F Diercksen. Minimal distance neural methods. Department of Computer Methods, Nicholas Copernicus University.

of the number of neighbors and for vowel the r-NN method gives 57.8% accuracy, but in both TABLE I The appendicitis, Wisconsin breast cancer data, hepatitis and the Cleveland heart data. Dataset and method Leave-one-out % The appendicitis data Bayes rule (statistical) 83.0 CART, C4.5 (dec. trees) 84.9 MLP+backpropagation 85.8 RIAC (prob. inductive) 86.9 9-NN 89.6 PVM, C-MLP2LN (logical

Wl odzisl and Rafal Adamczak and Krzysztof Grabczewski. Optimization of Logical Rules Derived by Neural Procedures. Department of Computer Methods, Nicholas Copernicus University.

(if there is no test set). All data are from the UCI repository [11], except for the appendicitis, obtained from the authors of [12] paper. Hepatitis dataset contains many missing values and if averages are used meaningless rules are obtained; here only attributes with few missing values were used (no more than 5). NASA shuttle (described below) and the

Wl/odzisl/aw Duch and Rafal Adamczak and Geerd H. F Diercksen. Classification, Association and Pattern Completion using Neural Similarity Based Methods. Department of Computer Methods, Nicholas Copernicus University.

85.5% (with 20 neurons), and inserting a new value that does not appear in the data, such as -100, decreased accuracy to 81.5% (using 22 neurons). The same behavior has been observed for Hepatitis dataset taken from the same source. the data contains 155 vectors, 18 attributes, 13 of them are binary, other have integer values. The last attribute has 67 missing values, attribute 16 has 29 missing

Elena Smirnova and Ida G. Sprinkhuizen-Kuyper and I. Nalbantis and b. ERIM and Universiteit Rotterdam. Unanimous Voting using Support Vector Machines. IKAT, Universiteit Maastricht.

using the proportions of correctly and incorrectly classified instances of SVM and VSSVM. All the cases in Table 1 resulted in a considerable information gain. We especially mention the Hepatitis dataset (polynomial kernel) for which the gain is 0.72 and the Sonar dataset (polynomial kernel) for which the gain is 0.68. 1 Note VS(I + , I - ) = VS(I + f , I - f ) VS(I + n , I - n ) and VS(I + , I -

Rafael S. Parpinelli and Heitor S. Lopes and Alex Alves Freitas. An Ant Colony Based System for Data Mining: Applications to Medical Data. CEFET-PR, CPGEI Av. Sete de Setembro, 3165.

AntClass 95.47% ± 1.62 5.60 ± 0.80 12.50 ± 2.84 C4.5 95.02% ± 0.31 11.1 ± 1.45 44.1 ± 7.48 Hepatitis Data Set AntClass 88.75% ± 6.73 2.70 ± 0.46 7.50 ± 2.01 C4.5 85.96% ± 1.07 4.4 ± 0.93 8.5 ± 3.04 Dermatology Data Set AntClass 84.21% ± 6.34 6.00 ± 0.00 79.00 ± 3.46 C4.5 89.05% ± 0.62 23.2 ± 1.99 91.7 ±

Suresh K. Choubey and Jitender S. Deogun and Vijay V. Raghavan and Hayri Sever. A comparison of feature selection algorithms in the context of rough classifiers.

different and zero otherwise, and the difference between two quantitative values is normalized into the interval [0,1]. We first consider results from Table 2. Except for Glass, Monks, and Hepatitis data sets, the performance obtained in Predictive Experiments approach those in the case of Upperbound Experiments. This suggests that for Glass, Monks, and Hepatitis data data set Size No. of Attributes

Takao Mohri and Hidehiko Tanaka. An Optimal Weighting Criterion of Case Indexing for Both Numeric and Symbolic Attributes. Information Engineering Course, Faculty of Engineering The University of Tokyo.

(vote, soybean, crx, hypo) were in the distribution floppy disk of Quinlan's C4.5 book (Quinlan 1993). The remaining four data sets (iris, hepatitis led, led-noise) were obtained from the Irvine Machine Learning Database (Murphy & Aha 1994). Including our 3 methods,VDM, PCF, CCF, IB4, and C4.5 are compared. Quinlan's C4.5 is a

Wl/odzisl/aw Duch and Rafal/ Adamczak Email:duchraad@phys. uni. torun. pl. Statistical methods for construction of neural networks. Department of Computer Methods, Nicholas Copernicus University.

p i (x) - p r (x) around x for which the two distributions cross. The simplest network constructed from FDA solution gives classification error which is as good as the original FDA. For such datasets [12] as Wisconsin breast cancer, hepatitis Cleveland heart disease or diabetes the network obtains better results already before the learning process starts, but for some datasets this is not the

Chris Drummond and Robert C. Holte. C4.5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling beats Over-Sampling. Institute for Information Technology, National Research Council Canada.

as they produced cost curves that captured all the qualitative features we observed in a larger set of experiments (including other UCI data sets: vote, hepatitis labor, letter-k and glass2). For these data sets, under-sampling combined with C4.5 is a useful baseline to evaluate other algorithms. Over-sampling, on the other hand, is not to

Alexander K. Seewald. Dissertation Towards Understanding Stacking Studies of a General Ensemble Learning Scheme ausgefuhrt zum Zwecke der Erlangung des akademischen Grades eines Doktors der technischen Naturwissenschaften.

heart-statlog Compressed glyph visualization for dataset hepatitis Figure 8.6: Glyph visualization for datasets audiology to hepatitis. 79 Compressed glyph visualization for dataset ionosphere Compressed glyph visualization for dataset iris Compressed

Ida G. Sprinkhuizen-Kuyper and Elena Smirnova and I. Nalbantis. Reliability yields Information Gain. IKAT, Universiteit Maastricht.

The contribution of the set not covered by VSSVM to the entropy is equal to: (1a nc log 2 a nc11All cases in table 1 resulted in a considerable information gain. We especially mention the hepatitis dataset (the case of polynomial kernel) of which the information gain is 0.72 (we even obtained perfect information!) and the labor dataset (the case of polynomial kernel) of which the information gain is

Return to Hepatitis data set page.

Supported By:

 In Collaboration With:

About  ||  Citation Policy  ||  Donation Policy  ||  Contact  ||  CML