Soybean (Large) Data Set
Below are papers that cite this data set, with context shown.
Papers were automatically harvested and associated with this data set, in collaboration with Rexa.info.
Return to Soybean (Large) data set page.
Rich Caruana and Alexandru Niculescu-Mizil. An Empirical Evaluation of Supervised Learning for ROC Area. ROCAI. 2004.
letters as negative, yielding a very unbalanced binary problem. LETTER.p2 uses letters A-M as positives and the rest as negatives, yielding a well balanced problem. HYPER SPECT is the IndianPine92 data set  where the difficult class Soybean mintill is the positive class. SLAC is a problem from the Stanford Linear 1 Department of Computer Science, Cornell University, Ithaca, NY 14853 USA email
Prem Melville and Raymond J. Mooney. Diverse ensembles for active learning. ICML. 2004.
In particular, we used a sample size of two for the primary dataset, and three for breast-w, soybean diabetes, vowel and credit-g. The primary aim of active learning is to reduce the amount of training data needed to induce an accurate model. To evaluate this, we
Rich Caruana and Alexandru Niculescu-Mizil and Geoff Crew and Alex Ksikes. Ensemble selection from libraries of models. ICML. 2004.
1-13 as class 0 and letters 14-26 as class 1, yielding a diÆcult, but balanced, problem. HYPER SPECT was converted to binary by treating the large confusable class Soybean Mintil as class 1. These data sets were selected because they are large enough to allow moderate size train and validation sets, and still have data left for large final test sets. For our experiments, we used training sets of 5000
Rich Caruana and Alexandru Niculescu-Mizil. Data Mining in Metric Space: An Empirical Analysis of Supervised Learning Performance Criteria. ROCAI. 2004.
25 letters as negative, yielding a very unbalanced binary problem. LETTER.p2 uses letters A-M as positives and the rest as negatives, yielding a well balanced problem. HYPER SP is the IndianPine92 data set  where the difficult class Soybean mintill is the positive class. SLAC is a problem from collaborators at the Stanford Linear Accelerator and MEDIS is a medical data set. The characteristics of
Vassilis Athitsos and Stan Sclaroff. Boosting Nearest Neighbor Classifiers for Multiclass Recognition. Boston University Computer Science Tech. Report No, 2004-006. 2004.
We did not use four datasets (dermatology soybean thyroid, audiology) because they have missing attributes, which our current formulation cannot handle. One dataset (ecoli) contains a nominal attribute, whichour current
Yuan Jiang and Zhi-Hua Zhou. Editing Training Data for kNN Classifiers with Neural Network Ensemble. ISNN (1). 2004.
of five hidden units. Therefore here the approach is denoted as NNEE(5,5). Table 6 shows that the NNEE approach achieves the best editing e®ect. In detail, it obtains the best performance on seven data sets, i.e. annealing, credit, liver, pima, soybean wine and zoo. RemoveOnly obtains the best performance on three data sets, i.e. glass, hayes-roth and wine. It is surprising that Depuration obtains
Geoffrey Holmes and Bernhard Pfahringer and Richard Kirkby and Eibe Frank and Mark A. Hall. Multiclass Alternating Decision Trees. ECML. 2002.
(class sizes 3-13) but struggles against two of the later datasets. For soybean 1-against-1 uses a tree of size 1710, and for primary-tumor it uses a tree of size 2310. Perhaps the most remarkable result is for half-letter where 1against-1 using 780 tests has an
Subramani Mani and Marco Porta and Suzanne McDermott. Building Bayesian Network Models in Medicine: the MENTOR Experience. Center for Biomedical Informatics University of Pittsburgh. 2002.
Our validation tests using LED, ALARM and SOYBEAN which are small to large artificial datasets used for Machine Learning research and available from the University of California at the Irvine Machine Learning repository [MuAh94] gave a mean accuracy of 80% over ten runs. The range was from
Marco Porta and Subramani Mani and Suzanne McDermott. MENTOR: Building Bayesian Network Models in Medicine CSCE Technical Report TR-2002-016. Department of Computer Science and Engineering University of South Carolina. 2002.
Our validation tests using LED, ALARM and SOYBEAN which are small to large artificial datasets used for Machine Learning research and available from the University of California at 10 the Irvine Machine Learning repository [MuAh94] gave a mean accuracy of 80% over ten runs. The range was
Bianca Zadrozny. Reducing multiclass to binary by coupling probability estimates. NIPS. 2001.
better. Figure 1 shows how the MSE is lowered at each iteration of the Hastie-Tibshirani algorithm, for the three types of code matrices. Table 3 shows the results of the same experiments on the datasets pendigits and soybean Again, the MSE is significantly lowered by the iterative procedure, in all cases. For the soybean dataset, using the sparse random matrix, the iterative method again has a
Rudy Setiono. Feedforward Neural Network Construction Using Cross Validation. Neural Computation, 13. 2001.
and the test set. The average number of hidden units ranged from 2.46 for the labor data set to 19.44 for the soybean data set. Most of the networks for the latter data set contain the maximum 20 hidden units. It might be possible to improve the overall predictive accuracy of these networks
Nikunj C. Oza and Stuart J. Russell. Experimental comparisons of online and batch versions of bagging and boosting. KDD. 2001.
only the first ten examples before being 2 Recall that we test base model hm on the training examples in order to adjust their weights before using them to generate base model hm+1 . Table 1: The datasets used in our experiments. For the Soybean and Census Income datasets, we have given the sizes of the supplied training and test sets. For the remaining datasets, we have given the sizes of the
Kiri Wagstaff and Claire Cardie. Clustering with Instance-level Constraints. ICML. 2000.
To test the effect of incorporating constraints, we selected three data sets from the UCI repository soybean mushroom, and tictactoe) and a fourth
eal-world" data set, pos. # soybean refers to the soybean-small data set, which consists of 47 instances with 34 nominal
Kai Ming Ting and Ian H. Witten. Issues in Stacked Generalization. J. Artif. Intell. Res. (JAIR, 10. 1999.
are given in Table 10, and indicate that the three methods are very competitive. 4 Stacking performs better than both arcing and bagging in three datasets (Waveform, Soybean and Breast Cancer), and is better than arcing but worse than bagging in the Diabetes dataset. Note that stacking performs very poorly on Glass and Ionosphere, two small
Mark A. Hall. Department of Computer Science Hamilton, NewZealand Correlation-based Feature Selection for Machine Learning. Doctor of Philosophy at The University of Waikato. 1999.
(chess end-game, horse colic, audiology, and soybean . Plots for the remaining datasets---and for when IB1 and C4.5 are used to measure accuracy---can be found in appendix E. The first thing that is apparent from Figure 6.23 is that a correspondence between merit and actual accuracy
Manoranjan Dash and Huan Liu. Hybrid Search of Feature Subsets. PRICAI. 1998.
having a large N and a small M values such as Lung Cancer, Promoters, Soybean Splice datasets ABB takes very long time (a number of hours) to terminate. For datasets having large N value and substantially big M value such as Splice dataset FocusM takes many hours to terminate. The
Huan Liu and Rudy Setiono. Incremental Feature Selection. Appl. Intell, 9. 1998.
in two separate files containing 307 and 376 patterns respectively. It contains 35 features describing symptoms of 19 different diseases in soybean plant. ffl Vote This dataset includes votes from the U.S. House of Representatives Congress-persons on the 16 key votes identified by the Congressional Quarterly Almanac Volume XL. The dataset consists of 16 features, 300
Hendrik Blockeel and Luc De Raedt and Jan Ramon. Top-Down Induction of Clustering Trees. ICML. 1998.
those obtained with the supervised learner Tilde. We see that TIC obtains high accuracies for these problems. The only clustering result we know of is for COBWEB, which obtained 100% on the Soybean data set. This difference is not significant. Tilde's ac72 73 74 75 76 77 78 79 80 81 15 20 25 30 35 40 45 50 55 accuracy (%) size validation set (%) accuracy of pruned tree accuracy of unpruned tree 15 20
Igor Kononenko and Edvard Simec and Marko Robnik-Sikonja. Overcoming the Myopia of Inductive Learning Algorithms with RELIEFF. Appl. Intell, 7. 1997.
(SOYB, IRIS, and VOTE are obtained from the Irvine database, SAT is obtained from the StatLog database ): SOYB: The famous soybean data set used by Michalski & Chilausky . IRIS: The well known Fisher's problem of determining the type of iris flower. MESH3,MESH15: The problem of determining the number of elements for each of the
Nir Friedman and Dan Geiger and Moisés Goldszmidt. Bayesian Network Classifiers. Machine Learning, 29. 1997.
where the unrestricted networks performed substantially worse reveals that in these networks the number of relevant attributes influencing the classification is rather small. While these data sets (` soybean large'' and ``satimage'') contain 35 and 36 attributes, respectively, the classifiers induced relied only on five attributes for the class prediction. We base our definition of relevant
Prototype Selection for Composite Nearest Neighbor Classifiers. Department of Computer Science University of Massachusetts. 1997.
and test accuracy. : : : : : : : : : : : : : : : : 77 3.10 Taxonomy of instances. : : : : : : : : : : : : : : : : : : : : : : : : : : : 79 3.11 Examples of instances in the taxonomy from the Soybean data set. : : : : : : 80 3.12 Average number of instances of each type in the taxonomy. : : : : : : : : : 81 3.13 Classification accuracy of the Prototype Sampling (PS) algorithm versus a variant of the
Guszti Bartfai. VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui. Department of Computer Science PO Box 600. 1996.
from the UCI Repository (Merz and Murphy, 1996) were used: ffl Soybean this data set contains 307 instances 5 described with 35 nominal attributes (and class information, which was ignored here). The attributes were encoded in generalised complement coding (see above) --- with
Kamal Ali and Michael J. Pazzani. Error Reduction through Learning Multiple Descriptions. Machine Learning, 24. 1996.
to test learned models on noise-free examples (including noisy variants of the KRK and LED domains) but for the natural domains we tested on possibly noisy examples. The large variant of the Soybean data set was used and the 5-class variant of the Heart data set was used. 5.1. Does using multiple rule sets lead to lower error? In this section we present results of an experiment designed to answer the
Ron Kohavi. The Power of Decision Tables. ECML. 1995.
in domains with continuous features indicates that many such features are not very useful, or that they contain few values, or that C4.5 is not using the information contained in them. The soybean dataset contains only one feature with more than four values, even though all are declared continuous. The german dataset contains 21 continuous features that have less than five values each (out of a total
Geoffrey I. Webb. OPUS: An Efficient Admissible Algorithm for Unordered Search. J. Artif. Intell. Res. (JAIR, 3. 1995.
Tic Tac Toe) disabling other pruning had little or no e®ect under best-first or depth-first search. The largest e®ects are 2.5 fold increases for the Soybean Large and Wisconsin Breast Cancer data sets under best-first search and for the Audiology, Soybean Large and Wisconsin Breast Cancer data sets under depth-first search. From these results it is apparent that while there are some data sets
Ron Kohavi. A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. IJCAI. 1995.
Mushroom Chess Hypo Breast Vehicle soybean Rand Figure 4: .632 Bootstrap: standard deviation in accuracy (population). sets themselves, leading to an increase in variance. This is most apparent for datasets with many categories, such as soybean. In these situations, stratification seems to help, but repeated runs may be a better approach. Our results indicate that stratification is generally a better
Thomas G. Dietterich and Ghulum Bakiri. Solving Multiclass Learning Problems via Error-Correcting Output Codes. CoRR, csAI/9501101. 1995.
employed in the study. The glass, vowel, soybean audiologyS, ISOLET, letter, and NETtalk data sets are available from the Irvine Repository of machine learning databases (Murphy & Aha, 1994). 1 The POS (part of speech) data set was provided by C. Cardie (personal communication); an earlier
Christophe Giraud and Tony Martinez and Christophe G. Giraud-Carrier. University of Bristol Department of Computer Science ILA: Combining Inductive Learning with Prior Knowledge and Reasoning. 1995.
are found, ILA degenerates into a restricted form of MBR . This allows ILA to perform well, even with simple (and possibly weak) generalization mechanisms (see for example the soybean small dataset in Section 3). Going further, one may consider extending ILA with some of the well established mechanisms of MBR to further improve ILA's performance in situations where only few rules are
Jitender S. Deogun and Vijay V. Raghavan and Hayri Sever. Exploiting Upper Approximation in the Rough Set Methodology. KDD. 1995.
that SBS + UC relatively performed much worse than UC was small soybean data set. When we continued our experiment on this data set with different `-reduct that was 20th and 21st features of soybean's training set, we obtained accuracy of 96.1%, which was much better than that
Geoffrey I. Webb. OPUS: A systematic search algorithm and its application to categorical attribute-value datadriven machine learning. School of Computing and Mathematics, Deakin University. 1993.
contained the least conjuncts. The OPUS o algorithm was modified to provide systematic search in this context. Cover was applied using the same experimental design as in the first study to the same data sets as well as the soybean large data set also from the UCI machine learning repository . This data set concerns the diagnosis of soybean plant disease. It has 35 attributes, 135 attribute values,
Nikunj C. Oza and Stuart J. Russell. Online Bagging and Boosting. Computer Science Division University of California.
and their performances relative to a single Naive Bayes classifier consistently improved as the sizes of the datasets grew. On the Balance and Soybean datasets, the boosting algorithms performed signi#cantly worse than Naive Bayes. On the Breast Cancer dataset, AdaBoost performed significantly worse and online
Perry Moerland. A Comparison of Mixture Models for Density Estimation. IDIAP.
The raw data has been pre-processed in various ways. First of all, the ordinal inputs have been normalized to have zero mean and unit standard deviation on the training data. For the soybean data set part of the inputs are categorical and these are mapped to a 1-of-c coding, thus increasing the number of attributes (see the fifth column of Table 1). Finally, for the data sets indicated with a #,
Zhi-Hua Zhou and Yang Yu. Ensembling Local Learners Through Multimodal Perturbation.
Attribute Data set Size Categorical Continuous Class COEF soybean 562 0 35 19 0.85 autos 159 15 10 7 0.91 sonar 208 60 0 2 1.73 lymph 148 3 15 4 2.06 glass 214 9 0 7 3.40 anneal 898 6 32 6 3.94 heart-c 296 6 7 5 4.55
Geoffrey I Webb. Generality is more significant than complexity: Toward an alternative to Occam's Razor. School of Computing and Mathematics Deakin University.
(Murphy & Aha, 1994): breast cancer, 5 echocardiogram, glass type, hepatitis, house votes 84, hypothyroid, iris, lymphography, primary tumor, and soybean large. For all of these data sets, the cases are divided into a number of mutually exclusive classes. The induction task is to develop an expert system that can classify a object by reference to the values of its attributes. All
Sherrie L. W and Zijian Zheng. A BENCHMARK FOR CLASSIFIER LEARNING. Basser Department of Computer Science The University of Sydney.
None, e.g. Lymphography and NetTalk (Phoneme) ffl Few (between 0 and 5.6%), e.g. Mushroom (1.39%) and Breast Cancer (W) (0.25%) ffl Many (more than 5.6%), e.g. Soybean (9.78%) and Thyroid (6.74%) 7. Dataset size (3 values): ffl Small (less than 210), e.g. Promoter (106) and Lymphography (148) ffl Medium (between 210 and 3170), e.g. Diabetes (768) and Thyroid (3163) ffl Large (more than 3170), e.g.
Alexander K. Seewald. Dissertation Towards Understanding Stacking Studies of a General Ensemble Learning Scheme ausgefuhrt zum Zwecke der Erlangung des akademischen Grades eines Doktors der technischen Naturwissenschaften.
instances. We rejected Principal Components Analysis after initial experiments because its linear projection on orthogonal axes is less general and the representation is noticeably worse for some datasets, e.g. soybean We rejected self-organizing 1 We used the Sammon Mapping implementation from Vesanto, Himberg, Alhoniemi & Parhankangas (2000), which was written in MatLab script language. It worked
Chotirat Ann and Dimitrios Gunopulos. Scaling up the Naive Bayesian Classifier: Using Decision Trees for Feature Selection. Computer Science Department University of California.
106 instances, 57 attributes, 2 classes. Attributes selected by SBC = 5. Soybean 30 40 50 60 70 80 90 100 10203040506070809099 Training Data (%) Accuracy (%) NBC SBC C4.5 Figure 9. Soybean-large dataset. 307 instances, 35 attributes, 19 classes. Attributes selected by SBC = 12. Wisconsin Breast Cancer 75 80 85 90 95 100 10203040506070809099 Training Data (%) Accuracy (%) NBC SBC C4.5 Figure 10.
Zhi-Hua Zhou and Xu-Ying Liu. Training Cost-Sensitive Neural Networks with Methods Addressing the Class Imbalance Problem.
annealing, hardensemble causes negative effect on one more data set, i.e. soybean Threshold-moving does not cause negative effect on glass, but it causes negative effect on lymphography and vowel. The sampling methods and SMOTE cause negative effect on more than
Prem Melville and Raymond J. Mooney. Proceedings of the 21st International Conference on Machine Learning. Department of Computer Sciences.
In particular, we used a sample size of two for the primary dataset, and three for breast-w, soybean diabetes, vowel and credit-g. The primary aim of active learning is to reduce the amount of training data needed to induce an accurate model. Toevaluate this, we
Jarinee Chattratichart and John Darlington and Moustafa Ghanem and Yang Guo and Harold Huning and Martin Kohler and Janjao Sutiwaraphun and Hing Wing and Dan Yang. Large Scale Data Mining: The Challenges and The Solutions. Department of Computing.
the data parallel scheme performs far better. In contrast for the Soybean data set, the task parallel versions are significantly superior. The reasons for this can be traced back to the shape of the decision tree which is constructed and the number of predefined classes. The
Daichi Mochihashi and Gen-ichiro Kikui and Kenji Kita. Learning Nonstructural Distance Metric by Minimum Cluster Distortions. ATR Spoken Language Translation research laboratories.
0 . 6 0 . 7 0 . 8 0 . 9 1 1 2 5 1 0 2 0 3 5 D i m e n s i o n P r e c i s i o n (d) soybean dataset Figure 4: K-means clustering of UCI Machine Learning dataset results. The horizontal axis shows compressed dimensions (rightmost is original). The right bar shows clustering precision using Metric
Miguel Moreira and Alain Hertz and Eddy Mayoraz. Data binarization by discriminant elimination. Proceedings of the ICML-99 Workshop: From Machine Learning to.
that Ideal is eliminative, while SG is constructive: SG is executed until consistency is attained, while in Ideal consistency is never lost, if verified initially. (This may not be the case for datasets with missing values, e.g. soybean ) The other, very important aspect is that Ideal is heavily grounded on the original attributes. Concerning the comparison between Ideal, SG, and decision trees,
Igor Kononenko and Edvard Simec. Induction of decision trees using RELIEFF. University of Ljubljana, Faculty of electrical engineering & computer science.
(SOYB, IRIS, and VOTE are obtained from the Irvine database (Murphy & Aha, 1991)): SOYB: The famous soybean data set used by Michalski & Chilausky (1980). IRIS: The well known Fisher's problem of determining the type of iris flower. MESH3,MESH15: The problem of determining the number of elements for each of the
BayesianClassifi552 Pat Langley and Wayne Iba. In Proceedings of the Tenth National ConferenceonArtifi256 Intelligence( 42840. Lambda Kevin Thompson.
C4 algorithm (Buntine & Caruana, 1991) and an algorithm that simply predicts the modal class. The five domains, from the UCI database collection (Murphy& Aha, 1992), include the ``small'' soybean dataset, chess end games involving aking-12 ok--126-22 wn confrontatwobiologidata set into 80% training instances and 20% testinpairs of training and test sets. The table shows the mean accuracy and 95%
YongSeog Kim and W. Nick Street and Filippo Menczer. Optimal Ensemble Construction via Meta-Evolutionary Ensembles. Business Information Systems, Utah State University.
there are four multi-class data sets (iris, hypo, segment, and soybean while the remaining 11 data sets are bi-class data sets. Out of four multi-class data sets, MEE shows consistently worse performance on "segment" data compared to
Iñaki Inza and Pedro Larraaga and Basilio Sierra. Bayesian networks for feature subset selection. Department of Computer Sciences and Artificial Intelligence.
on their stop-generation numbers to extract differences between the behaviour of GA-o and GA-u. Thus, observing Table 6, the one-point crossover is better suited for Horsecolic and Soybean large datasets than the uniform crossover; otherwise, we see the opposite behaviour in Ionosphere and Anneal. By the use of FSS-EBNA, we also avoid this tunning among different crossover operators for each
Perry Moerland. Mixtures of latent variable models for density estimation and classification. E S E A R C H R E P R O R T I D I A P D a l l e M o l l e I n s t i t u t e f o r Pe r cep t ua l A r t i f i c i a l Intelligence .
was used by imposing a small threshold of 0.01 upon the values of R j for Mfas and upon the variance parameters for a diagonal Gmm; this was done for the dermatology, NIST, optical, and soybean data sets. 7.2 Experiments: Real-World Data The results of the experiments with Bayes classifiers are listed in Table 6, where the best method and the ones that are not significantly worse (90% on the 5x2cv
Suresh K. Choubey and Jitender S. Deogun and Vijay V. Raghavan and Hayri Sever. A comparison of feature selection algorithms in the context of rough classifiers.
for which BFS + UC performed worse than UC were small soybean and glass data sets. AHS + UC, and HHS + UC performed worse than UC on glass, monks1, and monks2 data sets. KBS+UC performed poorly only in the case of glass data set. In the case of Upperbound Experiments, only monk1
Takao Mohri and Hidehiko Tanaka. An Optimal Weighting Criterion of Case Indexing for Both Numeric and Symbolic Attributes. Information Engineering Course, Faculty of Engineering The University of Tokyo.
that they maximize the variance ratio j 2 . Therefore, they have a theoretical basis and clear meaning. Experiments The experimental results for several benchmark data are shown in Table 3. Four data sets (vote, soybean crx, hypo) were in the distribution floppy disk of Quinlan's C4.5 book (Quinlan 1993). The remaining four data sets (iris, hepatitis, led, led-noise) were obtained from the Irvine