E. Coli Genes Data Set
Below are papers that cite this data set, with context shown.
Papers were automatically harvested and associated with this data set, in collaboration with Rexa.info.
Return to E. Coli Genes data set page.
Aik Choon Tan and David Gilbert. An Empirical Comparison of Supervised Machine Learning Techniques in Bioinformatics. APBC. 2003.
interested readers should refer to the cited papers for details. E coli data set -- The objective of this data set is to predict the cellular localisation sites of E.coli proteins (Horton and Nakai, 1996). There are 8 different cellular sites, which are cytoplasm (cp), inner
Mukund Deshpande and George Karypis. Evaluation of Techniques for Classifying Biological Sequences. PAKDD. 2002.
and contains 10,918 sequences of average length 361.6 symbols, it will be referred as M-EI. The E Coli Genome dataset contains sequences from the genome of prokaryote organism Escherichia coli (ecoli) [NTPP01]. To build this dataset the e-coli genome was first split into two sets of sequences, one containing
Mark A. Hall. Department of Computer Science Hamilton, NewZealand Correlation-based Feature Selection for Machine Learning. Doctor of Philosophy at The University of Waikato. 1999.
is to predict whether cancer will recur in patients. There are 9 nominal attributes describing characteristics such as tumour size and location. There are 286 instances. Dna-promoter (dna) A small dataset containing 53 positive examples of E coli promoter gene sequences and 53 negative examples. There are 55 nominal attributes representing the gene sequence. Each attribute is a DNA nucleotide
Prototype Selection for Composite Nearest Neighbor Classifiers. Department of Computer Science University of Massachusetts. 1997.
symbolic values. The data for the competition were divided into a single test and a single training set, but we have used a 10-fold cross-validation partition, for uniformity with the rest of our data sets. 12. Promoter. The task domain is E Coli promoter gene sequences (DNA), and the task is to classify gene sequences as "promoters" or "non-promoters" based on 57 sequential DNA nucleotide
Paul Horton and Kenta Nakai. Better Prediction of Protein Cellular Localization Sites with the it k Nearest Neighbors Classifier. ISMB. 1997.
used have been submitted to the UCI Machine Learning Data Repository (Murphy & Aha 1996) and are described in (Horton & Nakai 1996), (Nakai & Kanehisa 1991), and (Nakai & Kanehisa 1992). We used two datasets: an E coli dataset with 336 proteins sequences labeled according to 8 classes (localization sites) and a yeast dataset with 1462 sequences labeled according to 10 classes. The occurrence of classes
Andrew Watkins and Jon Timmis and Lois C. Boggess. Artificial Immune Recognition System (AIRS): An ImmuneInspired Supervised Learning Algorithm. (abw5,email@example.com) Computing Laboratory, University of Kent.
thorough examination of the tie-breaking mechanism in the k-nn algorithm. In the course of this latter experimentation, it was found that AIRS outperforms the best reported accuracy for the E coli data set found in the UCI repository . The majority of AIS techniques use the metaphor of somatic hypermutation or a±nity proportional mutation. To date, AIRS does not employ this metaphor but instead
Gaurav Marwah and Lois C. Boggess. Artificial Immune Systems for Classification : Some Issues. Department of Computer Science Mississippi State University.
data. The classes of antigen occurring more frequently were allocated more resources and those occurring less frequently were allocated fewer resources. Table 5: Accuracy Rates For E coli And Yeast Data Sets Using Different Methods For Resource Allocation. Method used for resource allocation Accuracy (E.Coli) Accuracy (Yeast) Half the resources for in class ARBs and the other half for out of class