Census Income Data Set
Below are papers that cite this data set, with context shown.
Papers were automatically harvested and associated with this data set, in collaboration with Rexa.info.
Return to Census Income data set page.
Aristides Gionis and Heikki Mannila and Panayiotis Tsaparas. Clustering Aggregation. ICDE. 2005.
by 22 categorical attributes, such as shape, color, odor, etc. There is a class label describing if a mushroom is poisonous or edible, and there are 2,480 missing values in total. Finally, the third dataset, census has been extracted from the census bureau database, and it contains demographic information on 32,561 people in the US. There are 8 categorical attributes (such as education, occupation,
Rakesh Agrawal and Ramakrishnan ikant and Dilys Thomas. Privacy Preserving OLAP. SIGMOD Conference. 2005.
from the UCI Machine Learning Repository , which has census information. The Adult dataset contains about 32,000 rows with 4 numerical columns. The columns and their ranges are: age[17 - 90], fnlwgt[10000 - 1500000], hrsweek[1 - 100] and edunum[1 - 16]. For synthetic data, we used
Manuel Oliveira. Library Release Form Name of Author: Stanley Robson de Medeiros Oliveira Title of Thesis: Data Transformation For Privacy-Preserving Data Mining Degree: Doctor of Philosophy Year this Degree Granted. University of Alberta Library. 2005.
is also available at the UCI Repository of Machine Learning Databases . 9. Pumsb: The Pumsb dataset contains census data for population and housing. This dataset is available at http://www.almaden.ibm.com/software/quest. There are 49,046 records with 2,113 different data values (distinct items),
Dan Pelleg. Scalable and Practical Probability Density Estimators for Scientific Anomaly Detection. School of Computer Science Carnegie Mellon University. 2004.
would be to first try and estimate # (say, using a model with spherical Gaussians) and use the estimate to set the rectangle tails. Experiments on real-life data were done on the "mpg" and census datasets from the UCI repository (Blake & Merz, 1998). The "mpg" data has about 400 records with 7 continuous 2 attributes. Running on this data with the number of components set to three, we get the
Douglas Burdick and Manuel Calimlim and Jason Flannick and Johannes Gehrke and Tomi Yiu. MAFIA: A Performance Study of Mining Maximal Frequent Itemsets. FIMI. 2003.
itemset patterns that peak around 10-25 items (see Figure 4). Chess and Connect4 are gathered from game state information and are available from the UCI Machine Learning Repository . The Pumsb dataset is census data from PUMS (Public Use Microdata Sample). Pumsb-star is the same dataset as Pumsb except all items of 80% support or more have been removed, making it less dense and easier to mine.
Bart Hamers and J. A. K Suykens. Coupled Transductive Ensemble Learning of Kernel Models. Bart De Moor. 2003.
donated by Kohavi. It involves the prediction whether income exceeds 50,000 dollars a year based on census data. The original data set consists out of 48,842 observations each described by six numerical and eight categorical attributes. All the observations with missing values were removed from consideration. To show the use of our
Ke Wang and Shiyu Zhou and Ada Wai-Chee Fu and Jeffrey Xu Yu. Mining Changes of Classification by Correspondence Tracing. SDM. 2003.
German Credit Data from the UCI Repository of Machine Learning Databases , and IPUMS Census Data from . These data sets were chosen because no special knowledge is required to understand the addressed applications. To verify if the proposed method finds the changes that are supposed to be found, we need to know such
Dennis P. Groth and Edward L. Robertson. An Entropy-based Approach to Visualizing Database Structure. VDB. 2002.
The data we use in our visualization is drawn from a variety of sources, including the U.S. Census , the U.C.I. Machine Learning Repository , and the Wisconsin Benchmark . The specific dataset we used for the Census was the 1990 Indiana Public Use Microdata Sample (PUMS), which has 125 attributes. Our first application is the visualization of frequency distributions. An obvious technique
Eibe Frank and Geoffrey Holmes and Richard Kirkby and Mark A. Hall. Racing Committees for Large Datasets. Discovery Science. 2002.
LogitBoost #Iterations Racing w/o pruning Racing w pruning anonymous 27.00% 60 28.24% 27.56% adult 13.51% 67 14.58% 14.72% shuttle 0.01% 86 0.08% 0.07% census income 4.43% 448 4.90% 4.93% The next dataset we consider is census-income. The first row of Figure 4 shows the results. The most striking aspect is the effect of pruning with small chunk sizes. In this domain the fluctuation in error is
James Bailey and Thomas Manoukian and Kotagiri Ramamohanarao. Fast Algorithms for Mining Emerging Patterns. PKDD. 2002.
70 75 80 85 90 95 100 4 5 6 7 8 9 10 Accuracy Threshold census (Ratio-Tree) threshold complete 0 50 100 150 200 4 5 6 7 8 9 10 User Time (sec) Threshold census (Ratio-Tree) threshold complete Dataset pt=4 pt=5 pt=6 pt=7 pt=8 pt=9 pt=10 original mushroom 6.03 6.11 6.28 6.48 6.82 7.38 8.19 138.45 census 16.23 17.46 20.78 27.75 40.61 61.71 91.75 1028.00 ionosphere 1.37 1.43 1.45 1.56 1.67 1.83 1.99
Zhiyuan Chen and Johannes Gehrke and Flip Korn. Query Optimization In Compressed Database Systems. SIGMOD Conference. 2001.
are not compressed. TPC-H data contains 8 tables and 61 attributes, 23 of which are string-valued. The string attributes account for about 60% of the total database size. We also used a 4MB of dataset with US census data, the adult data set  for experiments on compression strategies. The adult dataset contains a single table with 14 attributes, 8 of them string-valued, accounting for about 80%
Stephen D. Bay and Michael J. Pazzani. Detecting Group Differences: Mining Contrast Sets. Data Min. Knowl. Discov, 5. 2001.
sized divisions by frequency (e.g. income) or interval width (e.g. age). Finally,we further randomly sampled the data to obtain a 1 in 1000 sample. Federal Census data is one of the most difficult data sets to mine because of the long average record width coupled with the high number of popular attribute-value pairs which occur frequently in many records. These two factors combine to result in many
Nikunj C. Oza and Stuart J. Russell. Experimental comparisons of online and batch versions of bagging and boosting. KDD. 2001.
used in our experiments. For the Soybean and Census Income datasets, we have given the sizes of the supplied training and test sets. For the remaining datasets, we have given the sizes of the training and test sets in our #ve-fold crossvalidation runs. Data Set
Jinyan Li and Guozhu Dong and Kotagiri Ramamohanarao and Limsoon Wong. DeEPs: A New Instance-based Discovery and Classification System. Proceedings of the Fourth European Conference on Principles and Practice of Knowledge Discovery in Databases. 2001.
of DeEPs over the number of training instances. Decision speed. DeEPs is a fast classifier. Indeed, decision time per instance is typically a small fraction of a second. For only 5 of the 40 data sets census inc, letter, satimage, pendigits, and waveform) decision time per instance exceeds 1 second. All these five data sets have a very large volume of training instances, high dimensions, or
Dan Pelleg and Andrew W. Moore. Mixtures of Rectangles: Interpretable Soft Clustering. ICML. 2001.
form of a rectangle (in this case a line-segment) with tails. An M-dimensional tailed rectangle is simply a product of these. Experiments on real-life data were done on the ``mpg'' and ` census ' datasets from the UCI repository (Blake & Merz, 1998). The ``mpg'' data has about 400 records with 7 continuous 3 attributes. Running on this data with the number of components set to three, we get the
Stephen D. Bay. Multivariate Discretization for Set Mining. Knowl. Inf. Syst, 3. 2001.
for Census Income. We required differences between adjacent cells to be at least as large as 1% of N . ME-MDL requires a class variable and for the Adult, Census-Income, SatImage, and Shuttle datasets we used the class variable that had been used in previous analyses. For UCI Admissions we used Admit = fyes, nog (i.e. was the student admitted to UCI) as the class variable. 6.1. Execution Time
Jie Cheng and Russell Greiner. Comparing Bayesian Network Classifiers. UAI. 1999.
are given below. Adult dataset: The data was extracted from the census bureau database. Prediction task is to determine whether a person makes over 50K a year. The discretization process ignores "fnlwgt" (which is one of the 14
John C. Platt. Using Analytic QP and Sparseness to Speed Training of Support Vector Machines. NIPS. 1998.
can be found in [8, 7]. The first test set is the UCI Adult data set . The SVM is given 14 attributes of a census form of a household and asked to predict whether that household has an income greater than $50,000. Out of the 14 attributes, eight are categorical
Ron Kohavi. Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid. KDD. 1996.
the algorithm, and 20 intervals were used. The error bars show 95% confidence intervals on the accuracy, based on the leftout sample. In most cases it is clear that even with much more 1 The Adult dataset is from the Census bureau and the task is to predict whether a given adult makes more than $50,000 a year based attributes such as education, hours of work per week, etc.. 74 76 78 80 82 84 86 88 90
Gabor Melli. A Lazy Model-Based Approach to On-Line Classification. University of British Columbia. 1989.
........................104 8.1 Number of classification performed against the census year dataset by DBPredictor before C4.5 returns its first classification. . . . . . ............110 8.2 Number of classification performed against the census-year dataset by IB1 before DBPredictor returns its
Chris Giannella and Bassem Sayrafi. An Information Theoretic Histogram for Single Dimensional Selectivity Estimation. Department of Computer Science, Indiana University Bloomington.
there). We use the age column of the training dataset. The dataset was extracted from 1994 US census data. The shuttle2 dataset was downloaded from the "Esprit Project 5170 StatLog" archive ("Shuttle" heading): www.liacc.up.pt/ML/. It represents data
Masahiro Terabe and Takashi Washio and Hiroshi Motoda. The Effect of Subsampling Rate on S 3 Bagging Performance. Mitsubishi Research Institute.
each member classifier induction. A personal computer having the specification of OS: Linux OS, CPU: PentiumIII 700 MHz, and main memory: 256 M bytes is used in this experiment. For the large size data sets, census income (abbreviated here as census), led(10%) and waveform are selected. Census is selected from UCI KDD Table 2. The specification of data sets for experiment 2. Data set # of Attribute
David R. Musicant and Alexander Feinberg. Active Set Support Vector Regression.
problems. It contains 506 data points with 12 numeric attributes, and one binary categorical attribute. The goal is to determine median home values, based on various census attributes. This dataset is available at the UCI repository . The second dataset, Comp-Activ, was obtained from the Delve website . This dataset contains 8192 data points and 25 numeric attributes. We implemented
David R. Musicant. DATA MINING VIA MATHEMATICAL PROGRAMMING AND MACHINE LEARNING. Doctor of Philosophy (Computer Sciences) UNIVERSITY.
were used for testing the methods. The first dataset, Census is a version of the US Census Bureau "Adult" dataset, which is publicly available from Silicon Graphics' website . This dataset contains nearly 300,000 data points with 11 numeric