Center for Machine Learning and Intelligent Systems
About  Citation Policy  Donate a Data Set  Contact


Repository Web            Google
View ALL Data Sets

Ecoli Data Set

Below are papers that cite this data set, with context shown. Papers were automatically harvested and associated with this data set, in collaboration with Rexa.info.

Return to Ecoli data set page.


Vassilis Athitsos and Stan Sclaroff. Boosting Nearest Neighbor Classifiers for Multiclass Recognition. Boston University Computer Science Tech. Report No, 2004-006. 2004.

(dermatology,soybean, thyroid, audiology) because they have missing attributes, which our current formulation cannot handle. One dataset ecoli contains a nominal attribute, whichour current implementation cannot handle in practice; this Table 3. For each dataset, we counthowmanyvariations of AdaBoost.MO gave lower (<), equal (=),


Charles X. Ling and Qiang Yang and Jianning Wang and Shichao Zhang. Decision trees with minimal costs. ICML. 2004.

total cost. Aimed at minimizing the total cost of test and misclassification, our new decision-tree algorithm has several desirable features. We will discuss these features below, using the dataset Ecoli as an example (Blake & Merz 1998). This dataset, after pre-processing, has 332 labelled examples, which are described by six attributes. The numerical attributes are first discretized


Xiaoyong Chai and Li Deng and Qiang Yang and Charles X. Ling. Test-Cost Sensitive Naive Bayes Classification. ICDM. 2004.

were discretized using minimal entropy method as in [8]. Name of No. of Name No. of datasets attributes of datasets attributes Ecoli 6 Breast 9 Heart 8 Thyroid 24 Australia 15 Cars 6 Voting 16 Mushroom 22 Table 2. Datasets used in the experiments We ran a 3-fold cross validation on these


Aik Choon Tan and David Gilbert. An Empirical Comparison of Supervised Machine Learning Techniques in Bioinformatics. APBC. 2003.

interested readers should refer to the cited papers for details. E coli data set -- The objective of this data set is to predict the cellular localisation sites of E.coli proteins (Horton and Nakai, 1996). There are 8 different cellular sites, which are cytoplasm (cp), inner


Mukund Deshpande and George Karypis. Evaluation of Techniques for Classifying Biological Sequences. PAKDD. 2002.

contains sequences from the genome of prokaryote organism Escherichia coli ecoli [NTPP01]. To build this dataset the e-coli genome was first split into two sets of sequences, one containing sequences making up the coding region whereas other containing sequences making up the non-coding region. Next a fixed


Huajie Zhang and Charles X. Ling. An Improved Learning Algorithm for Augmented Naive Bayes. PAKDD. 2001.

we used in our experiment. Dataset Attributes Class Instances Ecoli 7 8 336 Vote 16 2 435 Pima 8 2 768 Australia 14 2 690 Breast 10 2 683 Segment 19 7 1540 Vehicle 18 4 846 Bank 20 2 1162 Table 1. Descriptions of domains used in our


Mark A. Hall. Department of Computer Science Hamilton, NewZealand Correlation-based Feature Selection for Machine Learning. Doctor of Philosophy at The University of Waikato. 1999.

is to predict whether cancer will recur in patients. There are 9 nominal attributes describing characteristics such as tumour size and location. There are 286 instances. Dna-promoter (dna) A small dataset containing 53 positive examples of E. coli promoter gene sequences and 53 negative examples. There are 55 nominal attributes representing the gene sequence. Each attribute is a DNA nucleotide


Prototype Selection for Composite Nearest Neighbor Classifiers. Department of Computer Science University of Massachusetts. 1997.

symbolic values. The data for the competition were divided into a single test and a single training set, but we have used a 10-fold cross-validation partition, for uniformity with the rest of our data sets. 12. Promoter. The task domain is E. Coli promoter gene sequences (DNA), and the task is to classify gene sequences as "promoters" or "non-promoters" based on 57 sequential DNA nucleotide


Paul Horton and Kenta Nakai. Better Prediction of Protein Cellular Localization Sites with the it k Nearest Neighbors Classifier. ISMB. 1997.

used have been submitted to the UCI Machine Learning Data Repository (Murphy & Aha 1996) and are described in (Horton & Nakai 1996), (Nakai & Kanehisa 1991), and (Nakai & Kanehisa 1992). We used two datasets: an E coli dataset with 336 proteins sequences labeled according to 8 classes (localization sites) and a yeast dataset with 1462 sequences labeled according to 10 classes. The occurrence of classes


Chotirat Ann and Dimitrios Gunopulos. Scaling up the Naive Bayesian Classifier: Using Decision Trees for Feature Selection. Computer Science Department University of California.

from the UCI databases, 5 of which Na´ve Bayesian classifier outperforms C4.5 and the other 5 of which C4.5 outperforms Na´ve Bayesian classifier. Table 1. Descriptions of domains used Dataset #Attributes #Classes #Instances Ecoli 8 8 336 GermanCredit 20 2 1,000 KrVsKp 37 2 3,198 Monk 6 2 554 Mushroom 22 2 8,124 Pima 8 2 768 Promoter 57 2 106 Soybean 35 19 307 Wisconsin 9 2 699 Vote 16 2


Andrew Watkins and Jon Timmis and Lois C. Boggess. Artificial Immune Recognition System (AIRS): An ImmuneInspired Supervised Learning Algorithm. (abw5,jt6@kent.ac.uk) Computing Laboratory, University of Kent.

thorough examination of the tie-breaking mechanism in the k-nn algorithm. In the course of this latter experimentation, it was found that AIRS outperforms the best reported accuracy for the E coli data set found in the UCI repository [4]. The majority of AIS techniques use the metaphor of somatic hypermutation or a▒nity proportional mutation. To date, AIRS does not employ this metaphor but instead


Gaurav Marwah and Lois C. Boggess. Artificial Immune Systems for Classification : Some Issues. Department of Computer Science Mississippi State University.

data. The classes of antigen occurring more frequently were allocated more resources and those occurring less frequently were allocated fewer resources. Table 5: Accuracy Rates For E coli And Yeast Data Sets Using Different Methods For Resource Allocation. Method used for resource allocation Accuracy (E.Coli) Accuracy (Yeast) Half the resources for in class ARBs and the other half for out of class


Return to Ecoli data set page.

Supported By:

 In Collaboration With:

About  ||  Citation Policy  ||  Donation Policy  ||  Contact  ||  CML