Molecular Biology (Splice-junction Gene Sequences)
Donated on 12/31/1991
Primate splice-junction gene sequences (DNA) with associated imperfect domain theory
Dataset Characteristics
Sequential, Domain-Theory
Subject Area
Biology
Associated Tasks
Classification
Feature Type
Categorical
# Instances
3190
# Features
60
Dataset Information
Additional Information
Problem Description: Splice junctions are points on a DNA sequence at which `superfluous' DNA is removed during the process of protein creation in higher organisms. The problem posed in this dataset is to recognize, given a sequence of DNA, the boundaries between exons (the parts of the DNA sequence retained after splicing) and introns (the parts of the DNA sequence that are spliced out). This problem consists of two subtasks: recognizing exon/intron boundaries (referred to as EI sites), and recognizing intron/exon boundaries (IE sites). (In the biological community, IE borders are referred to a ``acceptors'' while EI borders are referred to as ``donors''.) This dataset has been developed to help evaluate a "hybrid" learning algorithm (KBANN) that uses examples to inductively refine preexisting knowledge. Using a "ten-fold cross-validation" methodology on 1000 examples randomly selected from the complete set of 3190, the following error rates were produced by various ML algorithms (all experiments run at the Univ of Wisconsin, sometimes with local implementations of published algorithms). System -- Neither -- EI -- IE --------------------------------------------------- KBANN -- 4.62 -- 7.56 -- 8.47 BACKPROP -- 5.29 -- 5.74 -- 10.75 PEBLS -- 6.86 -- 8.18 -- 7.55 PERCEPTRON -- 3.99 -- 16.32 -- 17.41 ID3 -- 8.84 -- 10.58 -- 13.99 COBWEB -- 11.80 -- 15.04 -- 9.46 Near. Neighbor -- 31.11 -- 11.65 -- 9.09
Has Missing Values?
No
Variables Table
Variable Name | Role | Type | Description | Units | Missing Values |
---|---|---|---|---|---|
class | Target | Categorical | no | ||
instancename | ID | Categorical | no | ||
Base1 | Feature | Categorical | no | ||
Base2 | Feature | Categorical | no | ||
Base3 | Feature | Categorical | no | ||
Base4 | Feature | Categorical | no | ||
Base5 | Feature | Categorical | no | ||
Base6 | Feature | Categorical | no | ||
Base7 | Feature | Categorical | no | ||
Base8 | Feature | Categorical | no |
0 to 10 of 62
Additional Variable Information
1. One of {n ei ie}, indicating the class. 2. The instance name. 3-62. The remaining 60 fields are the sequence, starting at position -30 and ending at position +30. Each of these fields is almost always filled by one of {a, g, t, c}. Other characters indicate ambiguity among the standard characters according to the following table: character: meaning D: A or G or T N: A or G or C or T S: C or G R: A or G
Dataset Files
File | Size |
---|---|
splice.data | 311.5 KB |
splice.data.Z | 83 KB |
splice.names | 5.2 KB |
splice.theory | 2.6 KB |
Index | 205 Bytes |
Reviews
There are no reviews for this dataset yet.
pip install ucimlrepo
from ucimlrepo import fetch_ucirepo # fetch dataset molecular_biology_splice_junction_gene_sequences = fetch_ucirepo(id=69) # data (as pandas dataframes) X = molecular_biology_splice_junction_gene_sequences.data.features y = molecular_biology_splice_junction_gene_sequences.data.targets # metadata print(molecular_biology_splice_junction_gene_sequences.metadata) # variable information print(molecular_biology_splice_junction_gene_sequences.variables)
Molecular Biology (Splice-junction Gene Sequences) [Dataset]. (1991). UCI Machine Learning Repository. https://doi.org/10.24432/C5M888.
DOI
License
This dataset is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.
This allows for the sharing and adaptation of the datasets for any purpose, provided that the appropriate credit is given.