Adult Data Set
Below are papers that cite this data set, with context shown.
Papers were automatically harvested and associated with this data set, in collaboration with Rexa.info.
Return to Adult data set page.
Rakesh Agrawal and Ramakrishnan ikant and Dilys Thomas. Privacy Preserving OLAP. SIGMOD Conference. 2005.
present among the original set of queried columns. 7. EXPERIMENTS We next present an empirical evaluation of our algorithms on real as well as synthetic data. For real data, we used the Adult dataset, from the UCI Machine Learning Repository [5], which has census information. The Adult dataset contains about 32,000 rows with 4 numerical columns. The columns and their ranges are: age[17  90],
Rich Caruana and Alexandru NiculescuMizil. An Empirical Evaluation of Supervised Learning for ROC Area. ROCAI. 2004.
selection is done using the 1k validation sets, SVMs move slightly ahead of the neural nets.) Boosted stumps and plain decision trees are not competitive, though boosted stumps are best on the Adult data set. It is interesting to note that boosting weaker stump models is clearly inferior to boosting full decision trees on most of the test problems: boosting full decision trees yields better performance
Rich Caruana and Alexandru NiculescuMizil and Geoff Crew and Alex Ksikes. Ensemble selection from libraries of models. ICML. 2004.
but each ensemble is just a weighted average of models, so the average of a set of ensembles also is a simple weighted average of the baselevel models. Bagging is discussed in Section 5.3. 3. Data Sets We experiment with seven problems: ADULT COVER TYPE, LETTER.p1, and LETTER.p2 from the UCI Repository (Blake & Merz, 1998), MEDIS, a pneumonia data set, SLAC, data from collaborators at the
Bianca Zadrozny. Learning and evaluating classifiers under sample selection bias. ICML. 2004.
3.5. Experimental results To verify the effects of sample selection bias experimentally, we apply Naive Bayes, logistic regression, C4.5 and SVMLight (soft margin) (Joachims, 2000b) to the Adult dataset, available from the UCI Machine Learning repository (Blake & Merz, 1998). We assume that the original dataset is not biased and artificially simulate biasedness by generating a value for s for each
WeiChun Kao and KaiMin Chung and Lucas Assun and ChihJen Lin. Decomposition Methods for Linear Support Vector Machines. Neural Computation, 16. 2004.
used are in Table 3.2. The four small problems are from the statlog collection (Michie, Spiegelhalter, and Taylor 1994). The problem adult is compiled by Platt (1998) from the UCI "adult" data set (Blake and Merz 1998). Problem web is also from Platt. Problem ijcnn is from the first problem of IJCNN challenge 2001 (Prokhorov 2001). Note that we use the winner's transformation of the raw data
Saharon Rosset. Model selection via the AUC. ICML. 2004.
to prefer the KNN model most of the time. This illustrates the "bias" in using AUC to select classi#cation models, which we discuss in section 3.2 Finally, we performed experiments on a reallife data set. We used the Adult dataset available from the UCI repository (Blake & Merz, 1998). We used only the first ten variables in this dataset, to make a largescale experiment feasible, and compared
Andrew W. Moore and WengKeen Wong. Optimal Reinsertion: A New Search Operator for Accelerated and More Accurate Bayesian Network Structure Learning. ICML. 2003.
if and only if an odd number of parents have value "True". The nodes are thus noisy exclusiveors and so it is hard to learn a set of parents incrementally. Synth2 Synth3 Synth4 Figure 3. Synthetic datasets described in Section 3.1. R m AA adult 49K 15 7.7 Contributed to UCI by Ron Kohavi alarm 20K 37 2.8 Data generated from a standard Bayes Net benchmark (Beinlich et al., 1989). biosurv 150K 24 3.5
Alexander J. Smola and Vishy Vishwanathan and Eleazar Eskin. Laplace Propagation. NIPS. 2003.
# Ä # # # ##Ç # (15) with the joint minimizer being the average of the individual solutions. 5 Experiments To test our ideas we performed a set of experiments with the widely available Web and Adult datasets from the UCI repository [1]. All experiments were performed on a 2.4 MHz 2 Note that we had to replace the equality with set inclusion due to the fact that Ü is not everywhere differentiable, hence
I. Yoncaci. Maximum a Posteriori Tree Augmented Naive Bayes Classifiers. O EN INTEL.LIG ` ENCIA ARTIFICIAL CSIC. 2003.
In the rest of the section we discuss and justify these assertions into more detail. 14 Dataset MAPTAN MAPTAN+BMA sTAN sTAN+BMA adult 17.18 ± 0.68 17.19 ± 0.71 17.60 ± 0.82 17.60 ± 0.80 australian 19.91 ± 1.14 19.62 ± 1.13 25.39 ± 1.18 24.96 ± 1.13 breast 17.23 ± 1.21 16.89 ± 1.28 8.73 ± 0.87
Christopher R. Palmer and Christos Faloutsos. Electricity Based External Similarity of Categorical Attributes. PAKDD. 2003.
than the distance functions computed by D fr;P . Since D fr;P has been previously evaluated using single link hierarchical clustering, that is the algorithm that we will use here [3]. The three data sets weevaluated are: 1. Adult  a selection of fields from the 1994 census data collected in the United States. There are 32,561 training examples and 16,281 test examples with 6 numeric fields and 8
S. Sathiya Keerthi and ChihJen Lin. Asymptotic Behaviors of Support Vector Machines with Gaussian Kernel. Neural Computation, 15. 2003.
into (training set, test set) partitions. We consider only the first of those realizations. In addition, the problem adult from the UCI adult" data set (Blake and Merz 1998) and the problem web, both as compiled by Platt (1998), are also included. For each of these two datasets also, there are several realizations. For our study here, we only
Thomas Serafini and G. Zanghirati and Del Zanna and T. Serafini and Gaetano Zanghirati and Luca Zanni. DIPARTIMENTO DI MATEMATICA. Gradient Projection Methods for. 2003.
the MNIST database of handwritten digits [24] and the UCI Adult data set [27]. These experiments are carried out on a Compaq XP1000 workstation at 667MHz with 1GB of RAM, with standard C codes. All the considered methods compute the projection on the special feasible
Bart Hamers and J. A. K Suykens. Coupled Transductive Ensemble Learning of Kernel Models. Bart De Moor. 2003.
0.16 5.08e4 0.27 0.28 7.92e4 0.21 0.21 8.37e4 0.15 0.14 5.08e4 0.24 0.24 8.34e4 1 0 1 9.8e5 1 2.7e3 Table 2: Misclassification rates on a test set (TicTacToe (TTT), Australian Credit Card Data Set (ACR) and the ADULT Data Set (ADULT)). The number of models is indicated by the second number in Table 1, example TTT11 is an ensemble model based on 11 individual models on, the TTT prediction. We
Ramesh Natarajan and Edwin P D Pednault. Segmented Regression Estimators for Massive Data Sets. SDM. 2002.
Figure 1. A comparison of the lift on the Fingerhut data for the "Consolidated Payout Model"using the LRT methodology (left), and for the "Response Model"using the NBT methodology (right). 6.3 Adult data set This is a standard data set from [4] with the 32561 training and 16281 test data and about 7% missing value records in the training data. The data has 6 continuous and 8 nominal features and the
Bianca Zadrozny and Charles Elkan. Transforming classifier scores into accurate multiclass probability estimates. KDD. 2002.
# x### s # : the number of examples with score s that belong to class c divided by the total number of examples 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Adult Dataset NB Score Empirical class membership probability 8941 790 610 450 480 532 477 620 672 2710 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 The Insurance Company
Nitesh V. Chawla and Kevin W. Bowyer and Lawrence O. Hall and W. Philip Kegelmeyer. SMOTE: Synthetic Minority Oversampling Technique. J. Artif. Intell. Res. (JAIR, 16. 2002.
is to distinguish between nasal (class 0) and oral sounds (class 1). There are 5 features. The class distribution is 3,818 samples in class 0 and 1,586 samples in class 1. 3. The Adult dataset (Blake & Merz, 1998) has 48,842 samples with 11,687 samples belonging to the minority class. This dataset has 6 continuous features and 8 nominal features. SMOTE and SMOTENC (see Section 6.1)
S. Sathiya Keerthi and Kaibo Duan and Shirish Krishnaj Shevade and Aun Neow Poo. A Fast Dual Algorithm for Kernel Logistic Regression. ICML. 2002.
much faster than the BFGS algorithm. The difference is much higher for large values of C. To see how the cost of the SMO algorithm scales with data size, an experiment was done on the UCI Adult dataset (Merz and Murphy, 1998) by gradually increasing the training set size from 1605 to 22696 in eight steps and observing the training time. A line was then fitted to the plot of the log of the training
Stephen D. Bay and Michael J. Pazzani. Detecting Group Differences: Mining Contrast Sets. Data Min. Knowl. Discov, 5. 2001.
3 This program is available from http://fuzzy.cs.UniMagdeburg.de/#borgelt/.Version 1.8 of his program is incorporated in the data mining tool Clementine. 18 issues. We used the following datasets which are summarized in Table 2. Adult The Adult Census data contains information extracted from the 1994 CurrentPopulation Survey. There are variables such as age, working class, education, sex,
Jie Cheng and Russell Greiner. Learning Bayesian Belief Network Classifiers: Algorithms and System. Canadian Conference on AI. 2001.
used in the experiments. Dataset Attributes. Classes Instances Train Test Adult 13 2 32561 16281 Nursery 8 5 8640 4320 Mushroom 22 2 5416 2708 Chess 36 2 2130 1066 DNA 60 3 2000 1186 The experiments were carried out using our
Zhiyuan Chen and Johannes Gehrke and Flip Korn. Query Optimization In Compressed Database Systems. SIGMOD Conference. 2001.
are not compressed. TPCH data contains 8 tables and 61 attributes, 23 of which are stringvalued. The string attributes account for about 60% of the total database size. We also used a 4MB of dataset with US census data, the adult data set [5] for experiments on compression strategies. The adult dataset contains a single table with 14 attributes, 8 of them stringvalued, accounting for about 80%
Stephen D. Bay. Multivariate Discretization for Set Mining. Knowl. Inf. Syst, 3. 2001.
Data Set #Features # Continuous # Examples Adult 14 5 48812 CensusIncome 5 41 7 199523 SatImage 37 36 6435 Shuttle 10 9 48480 UCI Admissions 19 8 123028 Table 3. Discretization Time in CPU seconds Data Set
Bernhard Pfahringer and Geoffrey Holmes and Richard Kirkby. Optimizing the Induction of Alternating Decision Trees. PAKDD. 2001.
krvskp 3196 0.0 0 36 labor 57 33.6 8 8 mushroom 8124 1.3 0 22 promoters 106 0.0 0 57 sickeuthyroid 3163 6.5 7 18 sonar 208 0.0 60 0 splice 3190 0.0 0 61 vote 435 5.3 0 16 vote1 3 435 5.5 0 15 KDD Datasets coil 5822/4000 0.0 85 0 adult 32561/16281 0.2 6 8 art1 50000/50000 0.0 0 50 art2 50000/50000 0.0 25 25 art3 50000/50000 0.0 50 0 This section compares the performance of the original optimized
Kristin P. Bennett and Ayhan Demiriz and John ShaweTaylor. A Column Generation Algorithm For Boosting. ICML. 2000.
all points and # i measures the additional margin obtained by each point. AdaBoost also minimizes a margin cost function based on the margin obtained by each point. We ran experiments on two larger datasets: Forest and Adult from UCI(Murphy & Aha, 1992). Forest is a 54dimension dataset with 7 possible classes. The data are divided into 11340 training, 3780 validation and 565892 testing instances.
Dmitry Pavlov and Jianchang Mao and Byron Dom. ScalingUp Support Vector Machines Using Boosting Algorithm. ICPR. 2000.
one week of February 1998. The classification task, as we pose it, is to predict whether a user will visit the most popular site S based on his/her visiting pattern of all other sites. The Adult data set is available at UCI machine learning repository [1]. The task is to predict if the income of a person is greater than 50K based on several census parameters, such as age, education, marital status
Gary M. Weiss and Haym Hirsh. A Quantitative Study of Small Disjuncts: Experiments and Results. Department of Computer Science Rutgers University. 2000.
were compared as the training set size was varied. Because disjuncts of a specific size for most concepts cover very few examples, statistically valid comparison were possible for only 4 of the 30 datasets (Coding, Move, Adult and Market2); with the other datasets the number of examples covered by disjuncts of a given size is too small. The results for the Coding dataset are shown in Figure 8.
Dmitry Pavlov and Darya Chudova and Padhraic Smyth. Towards scalable support vector machines using squashing. KDD. 2000.
were ``The Microsoft Anonymous Web'' and the ''Forest Cover Type'' datasets available at UCI KDD archive [Bay99] and Adult dataset available at UCI machine learning repository [BM98]. Web data reflects the Web pages of www.microsoft.com that each user visited during one
Jie Cheng and Russell Greiner. Comparing Bayesian Network Classifiers. UAI. 1999.
used in the experiments. Instances Dataset Attributes. Classes Train Test Adult 13 2 32561 16281 Nursery 8 5 8640 4320 Mushroom 22 2 5416 2708 Chess 36 2 2130 1066 Car 6 4 1728 CV5 Flare 10 3 1066 CV5 Vote 16 2 435 CV5 Brief descriptions of
Petri Kontkanen and Jussi Lahtinen and Petri Myllymaki and Tomi Silander and Henry Tirri. Proceedings of Pre and Postprocessing in Machine Learning and Data Mining: Theoretical Aspects and Applications, a workshop within Machine Learning and Applications. Complex Systems Computation Group (CoSCo). 1999.
via the CoSCo group home page. 42 Australian Credit Balance Scale Connect4 German Credit Thyroid Disease Vehicle Silhouettes Figure 1: Examples of the twodimensional visualizations obtained. 43 dataset size #attrs. #classes Adult 32561 15 2 Australian Credit 690 15 2 Balance Scale 625 5 3 Breast Cancer (Wisconsin) 699 11 2 Breast Cancer 286 10 2 Connect4 67557 43 3 Credit Screening 690 16 2 Pima
Yk Huhtala and Juha Kärkkäinen and Pasi Porkka and Hannu Toivonen. Efficient Discovery of Functional and Approximate Dependencies Using Partitions. ICDE. 1998.
(shown only in the table). Approximate dependencies could not be discovered in the Adult data set with TANE/MEM due to the lack of main memory. To find out how the number of rows affects the algorithms, we ran a series of experiments with increasing number of rows. The relations were formed by
John C. Platt. Using Analytic QP and Sparseness to Speed Training of Support Vector Machines. NIPS. 1998.
can be found in [8, 7]. The first test set is the UCI Adult data set [5]. The SVM is given 14 attributes of a census form of a household and asked to predict whether that household has an income greater than $50,000. Out of the 14 attributes, eight are categorical
Ron Kohavi. Scaling Up the Accuracy of NaiveBayes Classifiers: A DecisionTree Hybrid. KDD. 1996.
the algorithm, and 20 intervals were used. The error bars show 95% confidence intervals on the accuracy, based on the leftout sample. In most cases it is clear that even with much more 1 The Adult dataset is from the Census bureau and the task is to predict whether a given adult makes more than $50,000 a year based attributes such as education, hours of work per week, etc.. 74 76 78 80 82 84 86 88 90
Ron Kohavi and Barry G. Becker and Dan Sommerfield. Improving Simple Bayes. Data Mining and Visualization Group Silicon Graphics, Inc.
especially larger ones, such as segment, mushroom, letter, and adult for a total of 37. The specific datasets are shown below in Table 2. Our main concern with estimating accuracy is that the estimate should be precise. Therefore, we ran different inducers on these datasets in two forms. If the dataset was
Shi Zhong and Weiyu Tang and Taghi M. Khoshgoftaar. Boosted Noise Filters for Identifying Mislabeled Data. Department of Computer Science and Engineering Florida Atlantic University.
in Table 1. Overall, BBFI significantly outperforms BBFII, except for low ( 20%) noise levels for the adult car, and nursery datasets. The reason BBFII performs poorly may be that too many clean instances are weighted low. The noise filter constructed in the next round loses strong support from clean data instances, which are
David R. Musicant. DATA MINING VIA MATHEMATICAL PROGRAMMING AND MACHINE LEARNING. Doctor of Philosophy (Computer Sciences) UNIVERSITY.
separability on SOR performance . . . . . . . . . . . . . 34 3.2 SOR, SMO, and SVM light comparison on the Adult dataset in R 123 . . . 36 3.3 SOR and LPC comparison on 1 million point dataset in R 32 . . . . . . . 38 3.4 SOR applied to 10 million point dataset in R 32 . . . . . . . . . . . . . . . 38 4.1 SOR training
William W. Cohen and Yoram Singer. A Simple, Fast, and Effective Rule Learner. AT&T LabsResearch Shannon Laboratory.
average ranks among these three are 1.8, 2.3, and 1.9. The largest ruleset produced by SLIPPER is 49 rules (for coding). Finally, we evaluated the scalability of the rule learners on several large datasets. We used adult blackjack, with the addition of 20 irrelevant noise variables; and market3, for which many examples were available. C4rules was not run, since it is known to have scalability
Haixun Wang and Philip S. Yu. SSDTNN: A SubspaceSplitting Decision Tree Classifier with Application to Target Selection. IBM T. J. Watson Research Center.
that have biased data distribution. We use 25% of the cases for training and the rest 75% for testing. The results in shown in Table 3. 19 Datasets Dataset Size Bias SPRINT SSDTNN Adult 32561 0.24 68 72.1 Anneal 5 798 0.08 100 100 Anneal U 798 0.04 100 100 breastcancer 466 0.37 76.9 89.2 Vehicle bus 846 0.23 43.3 68.2 Sick 3770 0.06 72.3
S. V. N Vishwanathan and Alexander J. Smola and M. Narasimha Murty. considerably faster than competing methods such as Sequential Minimal Optimization or the Nearest Point Algorithm. Machine Learning Program, National ICT for Australia.
was proposed by Alexis Wieland of MITRE Corporation and it is available from the CMU Artificial Intelligence repository. Both WSPBC and the Adult datasets are available from the UCI Machine Learning repository (Blake & Merz, 1998). We used the same values of # 2 as in Keerthi 3 It is a common misperception that different SV optimization algorithms
Grigorios Tsoumakas and Ioannis P. Vlahavas. Fuzzy MetaLearning: Preliminary Results. Greek Secretariat for Research and Technology.
from the Machine Learning Repository at the University of Irvine, California (Blake & Merz, 1998). These were the adult and chess data sets, large enough (} 1000 examples) to simulate distributed environment. Only two domains were selected at this stage of our research to investigate the performance of the suggested methodology. The
Josep Roure Alcobe. Incremental HillClimbing Search Applied to Bayesian Network Structure Learning. Escola Universitria Politcnica de Mataro.
by means of a parameter nRSS. 3.2 Experimental Results In this section we compare the performance of repeatedly using the batch algorithms against the corresponding incremental approach. We used the datasets Adult 48.842 instances and 13 variables), Mushroom (8.124 inst. and 23 var.) and Nursery (12.960 inst. and 9 var.) from the UCI machine learning repository [9], the Alarm dataset (20.000 inst. and
Ayhan Demiriz and Kristin P. Bennett and John Shawe and I. Nouretdinov V.. Linear Programming Boosting via Column Generation. Dept. of Decision Sciences and Eng. Systems, Rensselaer Polytechnic Institute.
of the final set of weak hypotheses. This is just a very simple method of boosting multiclass problems. Further investigation of LP multiclass approaches is needed. We ran experiments on larger datasets: Forest, Adult USPS, and Optdigits from UCI(Murphy & Aha, 1992). Forest is a 54dimension dataset with seven possible classes. The data are divided into 11340 training, 3780 validation, and 565892
Chris Giannella and Bassem Sayrafi. An Information Theoretic Histogram for Single Dimensional Selectivity Estimation. Department of Computer Science, Indiana University Bloomington.
was obtained from the UCI machine learning archive [1] (called the adult dataset there). We use the age column of the training dataset. The dataset was extracted from 1994 US census data. The shuttle2 dataset was downloaded from the "Esprit Project 5170 StatLog" archive
RongEn Fan and P. H Chen and C. J Lin. Working Set Selection Using the Second Order Information for Training SVM. Department of Computer Science and Information Engineering National Taiwan University.
image, diabetes, covtype, breastcancer, and abalone are from the UCI machine learning repository (Blake and Merz, 1998). Problems a1a and a9a are compiled in (Platt, 1998) from the UCI adult data set. Problems w1a and w8a are also from (Platt, 1998). The tree data set was originally used in (Bailey et al., 1993). The problem mg is a MackeyGlass time series. The data sets cpusmall and splice are
Petri Kontkanen and Jussi Lahtinen and Petri Myllymaki and Tomi Silander and Henry Tirri. USING BAYESIAN NETWORKS FOR VISUALIZING HIGHDIMENSIONAL DATA. Complex Systems Computation Group (CoSCo).
via the CoSCo group home page. 5 Australian Credit Balance Scale Connect4 German Credit Thyroid Disease Vehicle Silhouettes Figure 1: Examples of the twodimensional visualizations obtained. 6 dataset size #attrs. #classes Adult 32561 15 2 Australian Credit 690 15 2 Balance Scale 625 5 3 Breast Cancer (Wisconsin) 699 11 2 Breast Cancer 286 10 2 Connect4 67557 43 3 Credit Screening 690 16 2 Pima
Ahmed Hussain Khan and Intensive Care. MultiplierFree Feedforward Networks. 174.
network configuration, and many hundreds for promising ones. Test data results reported here represent the best performance of the optimal configurations. A. Forecasting the Onset of Diabetes This data set 5 is related to a group of adult women belonging to the Pima Indian tribe and was collected by the US National Institute of Diabetes and Digestive and Kidney Diseases [1]. The learning task is to
Luc Hoegaerts and J. A. K Suykens and J. Vandewalle and Bart De Moor. Subset Based Least Squares Subspace Regression in RKHS. Katholieke Universiteit Leuven Department of Electrical Engineering, ESATSCDSISTA.
side our approach achieves overall a much smaller O(nm) memory cost, compared to the typical O(n 2 ) and a computational complexity of O(nm 3 ) compared to the typical O(n 2 ). The ADULT UCI data set [33] consists of 45222 cases having 14 input variables. The aim is to classify if the income of a person is greater than 50K based on several census parameters, such as age, education, marital
David R. Musicant and Alexander Feinberg. Active Set Support Vector Regression.
Census 30k, is a version of the US Census Bureau Adult dataset, which is publicly available from Silicon Graphics' website [39]. This "Adult" dataset contains nearly 300,000 data points with 11 numeric attributes, and is used for predicting income levels based
Luc Hoegaerts and J. A. K Suykens and J. Vandewalle and Bart De Moor. Primal Space Sparse Kernel Partial Least Squares Regression for Large Scale Problems Special Session paper . Katholieke Universiteit Leuven Department of Electrical Engineering, ESATSCDSISTA.
sample with added noise (dots). The subset consists of 5 points, marked with a full dot on the figure. The ADULT UCI data set [24] consists of 45222 cases having 14 input variables. The aim is to classify if the income of a person is greater than 50K based on several census parameters, such as age, education, marital
Kuanming Lin and ChihJen Lin. A Study on Reduced Support Vector Machines. Department of Computer Science and Information Engineering National Taiwan University.
for protein secondary structure prediction [26]. Finally, the problem adult from the UCI "adult" data set [1] and compiled by Platt [20], is also included. For the adult dataset, there are several realizations. Here, we only consider the realization with the smallest training set; the full dataset with
Luca Zanni. An Improved Gradient Projectionbased Decomposition Technique for Support Vector Machines. Dipartimento di Matematica, Universitdi Modena e Reggio Emilia.
in [8]. In order to analyze the behaviour of the two solvers within GPDT2 we consider three large test problems of the form (1) derived by training Gaussian SVMs on the well known UCI Adult data set [22], WEB data set [26] and MNIST data set [18]. A detailed description of the test problems generation is reported in the Appendix. All the experiments are carried out with standard C code running
Jeff G. Schneider and Andrew W. Moore. Active Learning in Discrete Input Spaces. School of Computer Science Carnegie Mellon University.
regression, confidence intervals can be obtained using the usual t distributions for the mean response of a linear fit. 3 Experimental Results We test our active learning algorithms using the adult data set from the UCI Irvine machine learning repository [2]. We use the age as a continuous output and all other attributes as inputs. The other continuous attributes were discretized to three levels for
Omid Madani and David M. Pennock and Gary William Flake. CoValidation: Using Model Disagreement to Validate Classification Algorithms. Yahoo! Research Labs.
unlabeled data does not tend to wildly underestimate error, even though it's theoretically possible. 3 Experiments We conducted experiments on the 20 Newsgroups and Reuters21578 test categorization datasets 1 , and the Votes, Chess, Adult and Optics datasets from the UCI collection [BKM98]. We chose 1 Available from http://www.ics.uci.edu/ and http://www.daviddlewis.com/resources/testcollections/ two
