Ecoli
Donated on 8/31/1996
This data contains protein localization sites
Dataset Characteristics
Multivariate
Subject Area
Biology
Associated Tasks
Classification
Feature Type
Real
# Instances
336
# Features
7
Dataset Information
Additional Information
The references below describe a predecessor to this dataset and its development. They also give results (not cross-validated) for classification by a rule-based expert system with that version of the dataset. Reference: "Expert Sytem for Predicting Protein Localization Sites in Gram-Negative Bacteria", Kenta Nakai & Minoru Kanehisa, PROTEINS: Structure, Function, and Genetics 11:95-110, 1991. Reference: "A Knowledge Base for Predicting Protein Localization Sites in Eukaryotic Cells", Kenta Nakai & Minoru Kanehisa, Genomics 14:897-911, 1992.
Has Missing Values?
No
Introductory Paper
By P. Horton, K. Nakai. 1996
Published in Intelligent Systems in Molecular Biology
Variables Table
Variable Name | Role | Type | Description | Units | Missing Values |
---|---|---|---|---|---|
Sequence | ID | Categorical | Accession number for the SWISS-PROT database | no | |
mcg | Feature | Continuous | McGeoch's method for signal sequence recognition | no | |
gvh | Feature | Continuous | von Heijne's method for signal sequence recognition | no | |
lip | Feature | Binary | von Heijne's Signal Peptidase II consensus sequence score | no | |
chg | Feature | Binary | Presence of charge on N-terminus of predicted lipoproteins | no | |
aac | Feature | Continuous | score of discriminant analysis of the amino acid content of outer membrane and periplasmic proteins | no | |
alm1 | Feature | Continuous | score of the ALOM membrane spanning region prediction program | no | |
alm2 | Feature | Continuous | score of ALOM program after excluding putative cleavable signal regions from the sequence | no | |
class | Target | Categorical | no |
0 to 9 of 9
Additional Variable Information
1. Sequence Name: Accession number for the SWISS-PROT database 2. mcg: McGeoch's method for signal sequence recognition. 3. gvh: von Heijne's method for signal sequence recognition. 4. lip: von Heijne's Signal Peptidase II consensus sequence score. Binary attribute. 5. chg: Presence of charge on N-terminus of predicted lipoproteins. Binary attribute. 6. aac: score of discriminant analysis of the amino acid content of outer membrane and periplasmic proteins. 7. alm1: score of the ALOM membrane spanning region prediction program. 8. alm2: score of ALOM program after excluding putative cleavable signal regions from the sequence.
Class Labels
cp (cytoplasm) 143 im (inner membrane without signal sequence) 77 pp (perisplasm) 52 imU (inner membrane, uncleavable signal sequence) 35 om (outer membrane) 20 omL (outer membrane lipoprotein) 5 imL (inner membrane lipoprotein) 2 imS (inner membrane, cleavable signal sequence) 2
Dataset Files
File | Size |
---|---|
ecoli.data | 19 KB |
ecoli.names | 3 KB |
Papers Citing this Dataset
Sort by Year, desc
By P GaneshKumar, Siva A.P. 2016
Published in International Journal of Computer Science, Engineering and Applications.
By Dustin Tran, Minjae Kim, Finale Doshi-Velez. 2016
Published in ArXiv.
By Ying Shen, Lin Zhang. 2013
Published in ICIC.
By Joaquim Sá, João Gama, Raquel Sebastião, Luís Alexandre. 2009
Published in CAIP.
By Carlos Hernández-Espinosa, Joaquín Torres-Sospedra, Mercedes Fernández-Redondo. 2008
Published in ANNPR.
0 to 5 of 6
Reviews
There are no reviews for this dataset yet.
pip install ucimlrepo
from ucimlrepo import fetch_ucirepo # fetch dataset ecoli = fetch_ucirepo(id=39) # data (as pandas dataframes) X = ecoli.data.features y = ecoli.data.targets # metadata print(ecoli.metadata) # variable information print(ecoli.variables)
Nakai, K. (1996). Ecoli [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C5388M.
Creators
Kenta Nakai
DOI
License
This dataset is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.
This allows for the sharing and adaptation of the datasets for any purpose, provided that the appropriate credit is given.