Molecular Biology (Promoter Gene Sequences)
Donated on 6/29/1990
E. Coli promoter gene sequences (DNA) with partial domain theory
Dataset Characteristics
Sequential, Domain-Theory
Subject Area
Biology
Associated Tasks
Classification
Feature Type
Categorical
# Instances
106
# Features
-
Dataset Information
Additional Information
This dataset has been developed to help evaluate a "hybrid" learning algorithm ("KBANN") that uses examples to inductively refine preexisting knowledge. Using a "leave-one-out" methodology, the following errors were produced by various ML algorithms. (See Towell, Shavlik, & Noordewier, 1990, for details.) System -- Errors -- Comments ---------------------------------------------------------------- KBANN -- 4/106 -- a hybrid ML system BP -- 8/106 -- std backprop with one hidden layer O'Neill -- 12/106 -- ad hoc technique from the bio. lit. Near-Neigh -- 13/106 -- a nearest-neighbor algo (k=3) ID3 -- 19/106 -- Quinlan's decision-tree builder Type of domain: non-numeric, nominal (one of A, G, T, C) Note: DNA nucleotides can be grouped into a hierarchy, as shown below: X (any) / \ (purine) R Y (pyrimidine) / \ / \ A G T C Here is that hierachy in a text-friendly format: X (any) . R (purine) . . A . . G . Y (pyrimidine) . . T . . C
Has Missing Values?
No
Variables Table
Variable Name | Role | Type | Description | Units | Missing Values |
---|---|---|---|---|---|
no | |||||
no | |||||
no | |||||
no | |||||
no | |||||
no | |||||
no | |||||
no | |||||
no | |||||
no |
0 to 10 of 58
Additional Variable Information
1. One of {+/-}, indicating the class ("+" = promoter). 2. The instance name (non-promoters named by position in the 1500-long nucleotide sequence provided by T. Record). 3-59. The remaining 57 fields are the sequence, starting at position -50 (p-50) and ending at position +7 (p7). Each of these fields is filled by one of {a, g, t, c}.
Dataset Files
File | Size |
---|---|
promoters.data | 7 KB |
promoters.names | 3.4 KB |
promoters.theory | 1.9 KB |
Index | 172 Bytes |
Reviews
There are no reviews for this dataset yet.
pip install ucimlrepo
from ucimlrepo import fetch_ucirepo # fetch dataset molecular_biology_promoter_gene_sequences = fetch_ucirepo(id=67) # data (as pandas dataframes) X = molecular_biology_promoter_gene_sequences.data.features y = molecular_biology_promoter_gene_sequences.data.targets # metadata print(molecular_biology_promoter_gene_sequences.metadata) # variable information print(molecular_biology_promoter_gene_sequences.variables)
Harley, C., Reynolds, R., & Noordewier, M. (1987). Molecular Biology (Promoter Gene Sequences) [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C5S01D.
Creators
C. Harley
R. Reynolds
M. Noordewier
DOI
License
This dataset is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.
This allows for the sharing and adaptation of the datasets for any purpose, provided that the appropriate credit is given.