Housing Data Set
Below are papers that cite this data set, with context shown.
Papers were automatically harvested and associated with this data set, in collaboration with Rexa.info.
Return to Housing data set page.
Manuel Oliveira. Library Release Form Name of Author: Stanley Robson de Medeiros Oliveira Title of Thesis: Data Transformation For Privacy-Preserving Data Mining Degree: Doctor of Philosophy Year this Degree Granted. University of Alberta Library. 2005.
is also available at the UCI Repository of Machine Learning Databases . 9. Pumsb: The Pumsb dataset contains census data for population and housing This dataset is available at http://www.almaden.ibm.com/software/quest. There are 49,046 records with 2,113 different data values (distinct items),
Gavin Brown. Diversity in Neural Network Ensembles. The University of Birmingham. 2004.
2 we use a sigmoid output activation function. The ensemble is combined by a uniformly weighted linear combination. Dataset 1: Boston Housing This regression dataset concerns housing values in suburbs of Boston, the problem is to predict the median house price given a number of demographic features. There are 506
Predrag Radivojac and Zoran Obradovic and A. Keith Dunker and Slobodan Vucetic. Feature Selection Filters Based on the Permutation Test. ECML. 2004.
summarized in Table 1. The first nine were downloaded from the UCI repository , with dataset HOUSING converted into a binary classification problem according to the mean value of the target. Datasets MAMMOGRAPHY and OIL were constructed in  and , respectively, and provided to us by
Glenn Fung and M. Murat Dundar and Jinbo Bi and Bharat Rao. A fast iterative algorithm for fisher discriminant using heterogeneous kernels. ICML. 2004.
used in the literature for benchmarking from the UCI Machine Learning Repository (Murphy & Aha, 1992): Ionosphere, Cleveland Heart, Pima Indians, BUPA Liver and Boston Housing Additionally, a sixth dataset, the colon CAD dataset, relates to colorectal cancer diagnosis using virtual colonoscopy derived from computer tomographic images. We will refer to this dataset as the colon CAD dataset. The
Kristiaan Pelckmans and Jos De Brabanter and J. A. K Suykens and Bart De Moor and K. U. Leuven - ESAT. The Differogram: Non-parametric Noise Variance Estimation and its Use for Model Selection. SCDSISTA. 2004.
can therefor be used for picking good startingvalues for a local search based on a more powerful and computationally intensive way to achieve a good generalization performance. The Boston housing dataset (Blake and Merz, 1998) concerning the housing values in suburbs of Boston was used to benchmark the proposed method on a real world dataset. This set contains 506 instances of 12 continuous and 1
Bart Hamers and J. A. K Suykens. Coupled Transductive Ensemble Learning of Kernel Models. Bart De Moor. 2003.
models of both ensembles (uncoupled (left); coupled (right)). This shows the e®ect and improvement obtained by coupling of the learning processes for the individual submodels. 5.1.2 Boston housing Data Set The Boston housing data set is a multivariate regression data set of 506 cases in 14 attributes. It has two prototasks: NOX, in which the nitrous oxide level is to be predicted, and price MEDV, in
Christopher K I Williams and Carl Edward Rasmussen and Anton Schwaighofer and Volker Tresp. Observations on the Nystrom Method for Gaussian Process Prediction. Division of Informatics Gatsby Computational Neuroscience Unit University of Edinburgh University College London. 2002.
0.0885 # 0.0073 0.1171 # 0.0222 0.0846 400 0.0871 # 0.0071 0.0843 # 0.0026 0.0922 # 0.0193 0.0845 Table 1: Comparison of the Nystrom, SR, just-m and m-eigenvectors methods on the Boston housing data set for values of m of 100; 200; 300; 400. For the first three methods ten replications were used, with random choice of the x points; each entry shows the mean and standard deviation of the 10 MSE
David Hershberger and Hillol Kargupta. Distributed Multivariate Regression Using Wavelet-Based Collective Data Mining. J. Parallel Distrib. Comput, 61. 2001.
use to generate a regression model may not generate the MSE model for that amount of information transfer. The result also supports the MSE model result for the wavelet basis. The second benchmark data set we employ is the Boston Housing data set created by Harrison and Rubinfeld . This data set consists of 506 samples with 13 independent variables, 12 of which are real-valued, and one real-valued
Thomas Melluish and Craig Saunders and Ilia Nouretdinov and Volodya Vovk and Carol S. Saunders and I. Nouretdinov V.. The typicalness framework: a comparison with the Bayesian approach. Department of Computer Science. 2001.
regions for data with w ~ N(0,1) % confidence Mean tolerance region width a=1 a=1000 a=10000 Figure 1 Bayesian RR and RRCM on data generated with w ¸ N(0; 1) We also experimented on two benchmark dataset, the auto-mpg dataset and the Boston housing dataset. For each experiment, we show the percentage confidence against the percentage of labels outside the tolerance region predicted for that
Martin H C Law and James T. Kwok. Applying the Bayesian Evidence Framework to
u -Support Vector Regression. ECML. 2001.
task is to predict the ages of the abalones based on 8 input attributes. We used 256 patterns for training and the remaining 3921 patterns for testing. The experiment is repeated 25 times. The third data set is the Boston housing data, and the task is to predict housing values in the Boston suburbs using 13 input attributes. We used 128 patterns for training and the remaining 378 patterns for testing.
Peter L. Hammer and Alexander Kogan and Bruno Simeone and Sandor Szedm'ak. R u t c o r Research R e p o r t. Rutgers Center for Operations Research Rutgers University. 2001.
are obtained by lexicographically Page 28 RRR 7-2001 Figure 1: Cost of Classification Inaccuracy for # = 0 0 5 10 15 20 25 30 Credit Breast Cancer Boston Housing Diabetes Heart Disease Oil Voting Datasets Mean Cost LAD StrongSpanned StrongPrime Prime Figure 2: Cost of Classification Inaccuracy for # = 0.5 0 5 10 15 20 25 30 35 40 Credit Breast Cancer Boston Housing Diabetes Heart Disease Oil Voting
Zhi-Hua Zhou and Jianping Wu and Weiyu Tang and Zen Chen. Combining Regression Estimators: GA-Based Selective Neural Network Ensemble. International Journal of Computational Intelligence and Applications, 1. 2001.
than that generated by averaging all in most cases. Pairwise one-tailed t-tests also indicate that the generalization ability of GASEN and enumerating is not significantly different on all the data sets except Boston Housing where GASEN is significantly better than enumerating. Considering that enumerating can hardly work when there are lots of individual networks due to its extensive
Nir Friedman and Iftach Nachman. Gaussian Process Networks. UAI. 2000.
from the UCI machine learning repository . These data sets are: # Boston housing data set - a data set describing different aspects of neighborhoods in the Boston area, and the median price of houses in those neighborhoods. The data set contains 506
Endre Boros and Peter Hammer and Toshihide Ibaraki and Alexander Kogan and Eddy Mayoraz and Ilya B. Muchnik. An Implementation of Logical Analysis of Data. IEEE Trans. Knowl. Data Eng, 12. 2000.
in 85.51% of the cases. Furthermore, it is interesting to notice that the inclusion of additional patterns in the discriminant does not seem to improve the prediction accuracy. Boston Housing This dataset was created by D. Harrison and D. Rubinfeld in 1978 and contains 506 records describing housing values in the suburbs of Boston, depending on observations consisting of one binary and 12 continuous
Rudy Setiono and Huan Liu. A connectionist approach to generating oblique decision trees. IEEE Transactions on Systems, Man, and Cybernetics, Part B, 29. 1999.
351 34 continuous 9. Iris 150 4 continuous 10. Pima-diabetes 768 8 continuous 11. Sonar 208 60 continuous 12. Australian 690 14 mixed 13. HeartDisease 297 13 mixed 14. Housing 506 13 mixed Table 1: Dataset Summary. #Data - data size, Type - attribute type, and #A - number of attributes. 3. Apply NN-DT as follows: (a) Construct a network using the algorithm MLNNCA for the training dataset. Stop MLNNCA
Jinyan Li and Xiuzhen Zhang and Guozhu Dong and Kotagiri Ramamohanarao and Qun Sun. Efficient Mining of High Confidience Association Rules without Support Thresholds. PKDD. 1999.
rules and some very high (say 90%) confidence rules using approaches similar to mining top rules. Experimental results using the Mushroom, the Cleveland heart disease, and the Boston housing datasets are reported to evaluate the efficiency of the proposed approach. 1 Introduction Association rules  were proposed to capture significant dependence between items in transactional datasets. For
Christopher J. Merz and Michael J. Pazzani. A Principal Components Approach to Combining Regression Estimates. Machine Learning, 36. 1999.
No method in the first block does particularly well for the bodyfat or housing data sets indicating that a moderate amount of regularization is required there. Examining the more advanced methods for handling multicollinearity in the second block of rows reveals that PCR*, EG, and CR
H. Altay Guvenir and Ilhan Uysal. Regression on feature projections. a Department of Computer Engineering, Bilkent University. 1999.
q i 2 ^ t q i jj 13 In order to compare the RFP algorithm with KNN and Rules learning algorithms, we used abalone, auto-mpg, buying, country, cpu, electric, ¯are, housing read and servo real world datasets for function approximation (available at http://funapp.cs.bilkent.edu.tr ). The information about the number of instances, number and type of features and presence of missing values are
Ayhan Demiriz and Kristin P. Bennett and Mark J. Embrechts. Semi-Supervised Clustering Using Genetic Algorithms. Dept. 1999.
have all originally two-class output variable except Housing The output variable for this dataset was categorized at the level of 21.5. Each dataset was divided into three subsets after a standard normalization. We call these subsets the learning, testing and working sets. Currently 40% of data
Huan Liu and Rudy Setiono. Feature Transformation and Multivariate Decision Tree Induction. Discovery Science. 1998.
351 34 continuous 11 Iris 150 4 continuous 12 Pima-diabetes 768 8 continuous 13 Sonar 208 60 continuous 14 Australian 690 14 mixed 15 HeartDisease 297 13 mixed 16 Housing 506 13 mixed Table 1. Dataset Summary. #Data - data size, Type - attribute type, and #A - number of attributes. neural networks (NN) based on which BMDT builds MDT's. We want to understand whether the difference between every
Mauro Birattari and Gianluca Bontempi and Hugues Bersini. Lazy Learning Meets the Recursive Least Squares Algorithm. NIPS. 1998.
classical mean square error criterion: ^ y q = x 0 q ^ fi( ^ k); with ^ k = arg min k MSE(k) = arg min k P k i=1 ! i (e cv i (k)) 2 P k i=1 ! i ; (9) Table 1: A summary of the characteristics of the datasets considered. Dataset Housing Cpu Prices Mpg Servo Ozone Number of examples 506 209 159 392 167 330 Number of regressors 13 6 16 7 8 8 where ! i are weights than can be conveniently used to discount
Sreerama K. Murthy and Simon Kasif and Steven Salzberg. A System for Induction of Oblique Decision Trees. Department of Computer Science Johns Hopkins University. 1994.
of three different types of iris flower. Weiss and Kapouleas (1989) obtained accuracies of 96.7% and 96.0% on this data with back propagation and 1-NN, respectively. Housing Costs in Boston. This data set, also available as a part of the UCI ML repository, describes housing values in the suburbs of Boston as a function of 12 continuous attributes and 1 binary attribute (Harrison & Rubinfeld, 1978).
Tapani Raiko and Harri Valpola. MISSING VALUES IN NONLINEAR FACTOR ANALYSIS. Helsinki University of Technology, Neural Networks Research Centre.
The training set contains vectors more similar to the test set now. 4. Training and testing sets are permuted and 10 percent of the values are set to miss independently of any neighbours. The second data set is Boston housing data, which is publicly available at . It concerns housing values in suburbs of Boston. Data set contains 506 vectors of 13 dimensions excluding one binary attribute. Four of
Dorian Suc and Ivan Bratko. Combining Learning Constraints and Numerical Regression. National ICT Australia, Sydney Laboratory at UNSW.
which enables a better comparison of Q 2 to other methods. These data sets are AutoMpg, AutoPrice, Housing MachineCpu and Servo. The other three data sets are from dynamic domains where QUIN has typically been applied so far [Suc, 2003; Suc and Bratko, 2002] . It should
Ayhan Demiriz and Kristin P. Bennett. Chapter 1 OPTIMIZATIONAPPROACHESTOSEMI-SUPERVISED LEARNING. Department of Decision Sciences and Engineering Systems & Department of Mathematical Sciences, Rensselaer Polytechnic Institute.
of the working set is set to 50 points and rest of the data are used as the training set. We use the following formula to pick the penalty parameter: 1 The continuous response variable in Housing dataset was categorized at 21.5 12 APPLICATIONS AND ALGORITHMS OF COMPLEMENTARITY Table 1.2 Average Error Results for Inductive and Transductive SVM Methods Data Set SVM-RLP ÓÔµÖÕÉÑ Local SVM Local ÓÎµSÕ×Ñ
Luc Hoegaerts and J. A. K Suykens and J. Vandewalle and Bart De Moor. Subset Based Least Squares Subspace Regression in RKHS. Katholieke Universiteit Leuven Department of Electrical Engineering, ESAT-SCD-SISTA.
took unity values. The use of other kernels, like the polynomial or the sigmoidal kernel, did not produce such good results as the Gaussian kernel. 6.2 Real world data examples The Boston Housing data set  consists of 506 cases having p = 13 input variables. The aim is to predict the housing prices. We standardized the data to zero mean and unit variance. We picked at random a training set of
S. Sathiya Keerthi. Improvements to SMO Algorithm for SVM Regression. Author for Correspondence: Prof.
samples were chosen randomly. The performance of the four algorithms for the polynomial kernel k(x i ; x j ) = (1 + x i Delta x j ) p where p was chosen to be 3, is shown in Fig. 1. The second dataset is the Boston housing dataset which is a standard benchmark for testing regression algorithms. This dataset is available at UCI Repository . The dimension of the input is 13. We used the training
Jarkko Tikka. AB HELSINKI UNIVERSITY OF TECHNOLOGY Department of Automation and Systems Technology Jarkko Tikka Learning linear dependency trees from multivariate data. Helsinki University of Technology Abstract of Master's thesis Department of Automation and Systems Technology Author Date.
and ipkts. The value of the regression coefficient of wio is also negative in the latter case. A positive change in wio decreases the value of ipkts. 4.3 Boston housing data The second real world data set is called the Boston housing data. The data are got from the UCI repository of the databases . The data set concerns housing values in suburbs of Boston in the USA. The data were collected in
David R. Musicant. DATA MINING VIA MATHEMATICAL PROGRAMMING AND MACHINE LEARNING. Doctor of Philosophy (Computer Sciences) UNIVERSITY.
We implemented the "cpuSmall prototask", which involves using twelve of these attributes to predict what fraction of a CPU's processing time is devoted to a specific mode ("user mode"). The third dataset, Boston Housing is a fairly standard dataset used for testing regression problems. It contains 506 data points with 12 numeric attributes, and one binary categorical attribute. 96 The goal is to
Ayhan Demiriz and Kristin P. Bennett and John Shawe and I. Nouretdinov V.. Linear Programming Boosting via Column Generation. Dept. of Decision Sciences and Eng. Systems, Rensselaer Polytechnic Institute.
used in decision tree stumps experiments, we use four additional UCI datasets here. These are the House(16,435), Housing 13,506) 3 , Pima(8,768), and Spam(57,4601) datasets. As in the decision tree stumps experiments, we report results from 10-fold CV. Since the best # value
Jianping Wu and Zhi-Hua Zhou and Cheng-The Chen. Ensemble of GA based Selective Neural Network Ensembles. National Laboratory for Novel Software Technology Nanjing University.
mean squared error and corresponding standard deviation is also recorded. Experimental results are shown in Tab l e 1 . Statistical tests show that on the Friedman#1, Boston Housing and Ozone data sets, GASEN's generalization error is significantly lower than that of the simple ensemble method, and e-GASEN attains still lower generalization errors than GASEN. On the Servo data set, GASEN is
C. Titus Brown and Harry W. Bullen and Sean P. Kelly and Robert K. Xiao and Steven G. Satterfield and John G. Hagedorn and Judith E. Devaney. Visualization and Data Mining in an 3D Immersive Environment: Summer Project 2003.
We decided to explore labeling systems and layouts more in the future. 19 Figure 4.2: A closeup of two of the MPG graphs. 20 4.3 Housing This data set was analysed by Robert Xiao. Overview The housing data set consisted of data regarding 506 houses in Boston, Massachusetts. Thirteen continuous attributes, including a target variable of median
David R. Musicant and Alexander Feinberg. Active Set Support Vector Regression.
were used for the first round of experiments. The first dataset, Boston Housing is a fairly standard dataset used for testing regression problems. It contains 506 data points with 12 numeric attributes, and one binary categorical attribute. The goal is to
Nir Friedman and Daphne Koller (koller@cs. stanford. edu. A Bayesian Approach to Structure Discovery in Bayesian Networks. School of Computer Science & Engineering Hebrew University.
indicating that model selection is likely to return a fairly representative structure in this case. A second form of support for the non-mixing conjecture is obtained by considering an even smaller data set: the Boston housing data set, from the UCI repository (Murphy and Aha, 1995), is a continuous domain with 14 variables and 506 samples. Here, we considered linear Gaussian networks, and used a
Yin Zhang and W. Nick Street. Bagging with Adaptive Costs. Management Sciences Department University of Iowa Iowa City.
and the out-of-bag margin estimation will result in better generalization as it does in stacking. 3. Computational Experiments Bacing was implemented using MATLAB and tested on 14 UCI repository data sets : Autompg, Bupa, Glass, Haberman, Housing Cleveland-heart-disease, Hepatitis, Ion, Pima, Sonar, Vehicle, WDBC, Wine and WPBC. Some of the data sets do not originally depict two-class problems