Molecular Biology (Promoter Gene Sequences)

Donated on 6/29/1990

E. Coli promoter gene sequences (DNA) with partial domain theory

Dataset Characteristics

Sequential, Domain-Theory

Subject Area

Biology

Associated Tasks

Classification

Feature Type

Categorical

# Instances

106

# Features

Dataset Information

Additional Information

This dataset has been developed to help evaluate a "hybrid" learning algorithm ("KBANN") that uses examples to inductively refine preexisting knowledge. Using a "leave-one-out" methodology, the following errors were produced by various ML algorithms. (See Towell, Shavlik, & Noordewier, 1990, for details.) System -- Errors -- Comments ---------------------------------------------------------------- KBANN -- 4/106 -- a hybrid ML system BP -- 8/106 -- std backprop with one hidden layer O'Neill -- 12/106 -- ad hoc technique from the bio. lit. Near-Neigh -- 13/106 -- a nearest-neighbor algo (k=3) ID3 -- 19/106 -- Quinlan's decision-tree builder Type of domain: non-numeric, nominal (one of A, G, T, C) Note: DNA nucleotides can be grouped into a hierarchy, as shown below: X (any) / \ (purine) R Y (pyrimidine) / \ / \ A G T C Here is that hierachy in a text-friendly format: X (any) . R (purine) . . A . . G . Y (pyrimidine) . . T . . C

Has Missing Values?

Variables Table

Variable Name	Role	Type	Description	Units	Missing Values
					no
					no
					no
					no
					no
					no
					no
					no
					no
					no

Rows per page

0 to 10 of 58

Additional Variable Information

1. One of {+/-}, indicating the class ("+" = promoter). 2. The instance name (non-promoters named by position in the 1500-long nucleotide sequence provided by T. Record). 3-59. The remaining 57 fields are the sequence, starting at position -50 (p-50) and ending at position +7 (p7). Each of these fields is filled by one of {a, g, t, c}.

Dataset Files

File	Size
promoters.data	7 KB
promoters.names	3.4 KB
promoters.theory	1.9 KB
Index	172 Bytes

Download (5.1 KB)

0 citations

9439 views

Creators

C. Harley

R. Reynolds

M. Noordewier

DOI

10.24432/C5S01D

License

This dataset is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.

This allows for the sharing and adaptation of the datasets for any purpose, provided that the appropriate credit is given.