Ecoli

Donated on 8/31/1996

This data contains protein localization sites

Dataset Characteristics

Multivariate

Subject Area

Biology

Associated Tasks

Classification

Feature Type

Real

# Instances

336

# Features

7

Dataset Information

Additional Information

The references below describe a predecessor to this dataset and its development. They also give results (not cross-validated) for classification by a rule-based expert system with that version of the dataset. Reference: "Expert Sytem for Predicting Protein Localization Sites in Gram-Negative Bacteria", Kenta Nakai & Minoru Kanehisa, PROTEINS: Structure, Function, and Genetics 11:95-110, 1991. Reference: "A Knowledge Base for Predicting Protein Localization Sites in Eukaryotic Cells", Kenta Nakai & Minoru Kanehisa, Genomics 14:897-911, 1992.

Has Missing Values?

No

Introductory Paper

A Probabilistic Classification System for Predicting the Cellular Localization Sites of Proteins

By P. Horton, K. Nakai. 1996

Published in Intelligent Systems in Molecular Biology

Variables Table

Variable NameRoleTypeDemographicDescriptionUnitsMissing Values
SequenceIDCategoricalAccession number for the SWISS-PROT databaseno
mcgFeatureContinuousMcGeoch's method for signal sequence recognitionno
gvhFeatureContinuousvon Heijne's method for signal sequence recognitionno
lipFeatureBinaryvon Heijne's Signal Peptidase II consensus sequence scoreno
chgFeatureBinaryPresence of charge on N-terminus of predicted lipoproteinsno
aacFeatureContinuousscore of discriminant analysis of the amino acid content of outer membrane and periplasmic proteinsno
alm1FeatureContinuousscore of the ALOM membrane spanning region prediction programno
alm2FeatureContinuousscore of ALOM program after excluding putative cleavable signal regions from the sequenceno
classTargetCategoricalno

0 to 9 of 9

Additional Variable Information

1. Sequence Name: Accession number for the SWISS-PROT database 2. mcg: McGeoch's method for signal sequence recognition. 3. gvh: von Heijne's method for signal sequence recognition. 4. lip: von Heijne's Signal Peptidase II consensus sequence score. Binary attribute. 5. chg: Presence of charge on N-terminus of predicted lipoproteins. Binary attribute. 6. aac: score of discriminant analysis of the amino acid content of outer membrane and periplasmic proteins. 7. alm1: score of the ALOM membrane spanning region prediction program. 8. alm2: score of ALOM program after excluding putative cleavable signal regions from the sequence.

Class Labels

cp (cytoplasm) 143 im (inner membrane without signal sequence) 77 pp (perisplasm) 52 imU (inner membrane, uncleavable signal sequence) 35 om (outer membrane) 20 omL (outer membrane lipoprotein) 5 imL (inner membrane lipoprotein) 2 imS (inner membrane, cleavable signal sequence) 2

Papers Citing this Dataset

INCREMENTAL SEMI -SUPERVISED CLUSTERING METHOD USING NEIGHBOURHOOD ASSIGNMENT

By P GaneshKumar, Siva A.P. 2016

Published in International Journal of Computer Science, Engineering and Applications.

Spectral M-estimation with Applications to Hidden Markov Models

By Dustin Tran, Minjae Kim, Finale Doshi-Velez. 2016

Published in ArXiv.

Improving Classification Accuracy Using Gene Ontology Information

By Ying Shen, Lin Zhang. 2013

Published in ICIC.

Decision Trees Using the Minimum Entropy-of-Error Principle

By Joaquim Sá, João Gama, Raquel Sebastião, Luís Alexandre. 2009

Published in CAIP.

Researching on Multi-net Systems Based on Stacked Generalization

By Carlos Hernández-Espinosa, Joaquín Torres-Sospedra, Mercedes Fernández-Redondo. 2008

Published in ANNPR.

0 to 5 of 6

Reviews

There are no reviews for this dataset yet.

Login to Write a Review
Download
6 citations
29566 views

Creators

Kenta Nakai

License

By using the UCI Machine Learning Repository, you acknowledge and accept the cookies and privacy practices used by the UCI Machine Learning Repository.

Read Policy