Center for Machine Learning and Intelligent Systems
About  Citation Policy  Donate a Data Set  Contact

Repository Web            Google
View ALL Data Sets

× Check out the beta version of the new UCI Machine Learning Repository we are currently testing! Contact us if you have any issues, questions, or concerns. Click here to try out the new site.

E. Coli Genes Data Set
Download: Data Folder, Data Set Description

Abstract: Data giving characteristics of each ORF (potential gene) in the E. coli genome. Sequence, homology (similarity to other genes) and structural information, and function (if known) are provided.

Data Set Characteristics:  


Number of Instances:




Attribute Characteristics:


Number of Attributes:


Date Donated


Associated Tasks:


Missing Values?


Number of Web Hits:



Original Owner and Donor:

Ross D. King
Department of Computer Science,
University of Wales Aberystwyth,
SY23 3DB, Wales
rdk '@'

Data Set Information:

The data was collected from several sources, including GenProtEC ([Web Link]) and SWISSPROT ([Web Link]). Structure prediction was made by PROF ([Web Link]). Homology search was provided by PSI-BLAST ([Web Link]).

The data is in Datalog format. Missing values are not explicit, but some genes have more relationships than others.

E. coli genes (ORFs) are related to each other by the predicate ecoli_to_ecoli(EcoliNumber,E-value,Psi-blast_iteration). They are related to other (SWISSPROT) proteins by the predicate e_val(AccNo,E-value). All the data for a single gene (ORF) is enclosed between delimiters of the form:


The gene functional classes are in a hierarchy. See [Web Link] (note: the classes may have changed since original data collection).

There are two datalog files: and


Lists classes and ORF functions. Lines are of the following form:

class(5,1,1,'Colicin-related functions').
class(5,1,'Laterally acquirred elements').

Arguments are up to 3 numbers (describing class at up to 3 different levels), followed by a string class description. For example:

function(ecoli210,7,0,0,'b0217','putative aminopeptidase').

Arguments are ORF number, exactly 3 class numbers, gene name (or blattner number if no gene name), ORF description.


Data for each ORF (gene) is delimited by


where X is the ORF number. Other predicates are as follows (examples):
ecoli_orf(ecoliX). % X is ORF number
ecoli_mol_wt(176624.1). % float
ecoli_theo_pI(5.81). %float
ecoli_atomic_comp(c,7940). % {c,h,n,o,s} , int
ecoli_aliphatic_index(69.57). % float
ecoli_hydro(-0.549). % float
sec_struc(1,c,2). % int (start), {a,b,c}, int (length)
sec_struc_coil(1,2). % int (start), int (length)
sec_struc_beta(1,5). % int (start), int (length)
sec_struc_alpha(1,7). % int (start), int (length)
sequence_length(255). % int
amino_acid_ratio(a,8.9). % amino_acid_char, float
amino_acids(ecoli3013,a,70). % ORF_num, amino_acid_char, int
amino_acid_pair_ratio(a,a,9.0). % amino_acid_char, amino_acid_char, float
amino_acid_pairs(a,a,7). % amino_acid_char, amino_acid_char, int
ecoli_to_ecoli(1170,1.0e-105,5). % ORF_num, double (e-value), int (iteration)
e_val(o42893,2.0e-99). % accession_number, double (e-value)
psi_iter(o42893,5). % accession_number, int (iteration)
species(p52494,'candida_albicans__yeast_'). % accession_number, string
mol_wt(p52494,104022). % accession_number, int
classification(p52494,candida). % accession_number, name
keyword(p25195,'plasmid'). % accession_number, string

Attribute Information:


Relevant Papers:

King, R. and Karwath, A. and Clare, A. and Dehaspe, L. (2001). The Utility of Different Representations of Protein Sequence for Predicting Functional Class, Bioinformatics, 17(5), pages 445--454.
[Web Link]

Papers That Cite This Data Set1:

Aik Choon Tan and David Gilbert. An Empirical Comparison of Supervised Machine Learning Techniques in Bioinformatics. APBC. 2003. [View Context].

Mukund Deshpande and George Karypis. Evaluation of Techniques for Classifying Biological Sequences. PAKDD. 2002. [View Context].

Mark A. Hall. Department of Computer Science Hamilton, NewZealand Correlation-based Feature Selection for Machine Learning. Doctor of Philosophy at The University of Waikato. 1999. [View Context].

. Prototype Selection for Composite Nearest Neighbor Classifiers. Department of Computer Science University of Massachusetts. 1997. [View Context].

Paul Horton and Kenta Nakai. Better Prediction of Protein Cellular Localization Sites with the it k Nearest Neighbors Classifier. ISMB. 1997. [View Context].

Andrew Watkins and Jon Timmis and Lois C. Boggess. Artificial Immune Recognition System (AIRS): An ImmuneInspired Supervised Learning Algorithm. (abw5, Computing Laboratory, University of Kent. [View Context].

Gaurav Marwah and Lois C. Boggess. Artificial Immune Systems for Classification : Some Issues. Department of Computer Science Mississippi State University. [View Context].

Citation Request:

Usage Restrictions:
Copyright 2000 by R. D. King, A. Karwath, A. Clare, L. Dehaspe

There are no restrictions data usage. This data is provided "as is" and without any express or implied warranties, including, without limitation, the implied warranties of merchantibility and fitness for a particular purpose.

Citation Requests:
Please cite King et al. (2000).

This work was supported by the following grants: G78/6609, BIF08765, GR/L62849 and by PharmaDM, Ambachtenlaan, 54/D, B-3001 Leuven, Belgium

[1] Papers were automatically harvested and associated with this data set, in collaboration with

Supported By:

 In Collaboration With:

About  ||  Citation Policy  ||  Donation Policy  ||  Contact  ||  CML