Yeast Data Set
Below are papers that cite this data set, with context shown.
Papers were automatically harvested and associated with this data set, in collaboration with Rexa.info.
Return to Yeast data set page.
Vassilis Athitsos and Stan Sclaroff. Boosting Nearest Neighbor Classifiers for Multiclass Recognition. Boston University Computer Science Tech. Report No, 2004-006. 2004.
where that method does better than the other methods. There are also two datasets (glass and yeast where the results of our algorithm and the best results from ECOC-based boosting and naive k-nn classification are quite similar. We should mention that, in the Allwein et al.
Samuel Kaski and Jaakko Peltonen. Informative Discriminant Analysis. ICML. 2003.
of noise in parameter validation. 5. Analysis of Gene Expression Data In this section we demonstrate one way of using the extracted components for exploratory analysis of yeast gene expression. The data set (Hughes et al., 2000) consists of measurements of the expression of each yeast gene in 300 knock-out mutation experiments. After leaving out all genes and experiments without significant expression,
Dmitry Pavlov and Alexandrin Popescul and David M. Pennock and Lyle H. Ungar. Mixtures of Conditional Maximum Entropy Models. ICML. 2003.
from university computer science departments. We used all classes but others and different numbers (up to 1000) of the most frequent words. The Letter recognition, Yeast MS Web, Vehicle and Vowel data sets were downloaded from the UC Irvine machine learning repository (Blake & Merz, 1998). In the MS Web data set, we predicted whether a user visited the free downloads" web page, given the rest of his
Aik Choon Tan and David Gilbert. An Empirical Comparison of Supervised Machine Learning Techniques in Bioinformatics. APBC. 2003.
of predicted lipoproteins and 3 different scoring functions on the amino acid contents whether predicted as a outer membrane or inner membrane, cleavable or uncleavable sequence signal. Yeast data set -- The objective is similar to the E.coli data, which is to determine the cellular localisation of the yeast proteins (Horton and Nakai, 1996). There are 10 different sites, which include: CYT
Manoranjan Dash and Kiseok Choi and Peter Scheuermann and Huan Liu. Feature Selection for Clustering - A Filter Solution. ICDM. 2002.
is taken from the recently publicized clustering software CLUTO available from the web site http://www-users.cs.umn.edu/ ~ karypis/cluto/. In this, there is a dataset called Genes2 which has 99 yeast genes (or data points) described using 7 profiles (or features). When ForwardSelect is run over this data, it shows the minimum entropy for subset fF3,F5g (see
Nitesh V. Chawla and Kevin W. Bowyer and Lawrence O. Hall and W. Philip Kegelmeyer. SMOTE: Synthetic Minority Over-sampling Technique. J. Artif. Intell. Res. (JAIR, 16. 2002.
solvent) was measured. The activity classes are either active --- at least one single yeast strain was inhibited more than 70%, or inactive --- no yeast strain was inhibited more than 70%. The dataset has 53,220 samples with 6,351 samples of active compounds. 5. The Satimage dataset (Blake & Merz, 1998) has 6 classes originally. We chose the smallest class as the minority class and collapsed the
Erin L. Allwein and Robert E. Schapire and Yoram Singer. Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers. ICML. 2000.
the bars in the top row of Figure 6 correspond to large positive values.) One-against-all often results in error rates that are much higher than the error rates of other codes. For instance, for the dataset Yeast the one-against-all code has an error rate of 72% while the error rate of all the other codes is no more than 47:1% (random sparse) and can be as low as 39:6% (random dense). On the very few
Paul Horton and Kenta Nakai. Better Prediction of Protein Cellular Localization Sites with the it k Nearest Neighbors Classifier. ISMB. 1997.
with 336 proteins sequences labeled according to 8 classes (localization sites) and a yeast dataset with 1462 sequences labeled according to 10 classes. The occurrence of classes for the datasets are summarized in tables 1 and 2. The 1462 yeast sequences were obtained by removing 12 sequences
Alain Rakotomamonjy. Analysis of SVM regression bounds for variable ranking. P.S.I CNRS FRE 2645, INSA de Rouen Avenue de l'Universite.
performance considering that an oracle has selected the optimal number of variables to use. In bold, the best performance with respect to the mean normalized absolute error are highlighted. Datasets olitos yeast colon lymphoma Algo Mean Var. p-val Mean Var. p-val Mean Var p-val. Mean Var p-val. SVM 30.93 25 - 16.50 79 - 53.75 2000 - 58.80 4026 - CC 30.38 13 0.424 14.73 23 0.045 51.01 300 0.517
Johannes Furnkranz. Round Robin Rule Learning. Austrian Research Institute for Artificial Intelligence.
683 --- 35 0 19 94.1 thyroid (hyper) 3772 --- 21 6 5 2.7 thyroid (hypo) 3772 --- 21 6 5 7.7 thyroid (repl.) 3772 --- 21 6 4 3.3 vehicle 846 --- 0 18 4 74.2 yeast 1484 --- 0 8 10 68.8 Table 1: Data sets used. The first two columns show the training and test set sizes (as specified in the description of the datasets), the next three columns show the number of symbolic and numeric attributes as well
Gaurav Marwah and Lois C. Boggess. Artificial Immune Systems for Classification : Some Issues. Department of Computer Science Mississippi State University.
the accuracy of the classifier varies depending on the characteristics of the problem. The need for these alternatives was realized while testing AIRS on the well-known and publicly available yeast data set, which appears to be a difficult classification problem. The data set was obtained from the repository of the University of California at Irvine (Blake and Merz, 1998) and contained 1484 instances