Molecular Biology (Protein Secondary Structure)
From CMU connectionist bench repository; Classifies secondary structure of certain globular proteins
Dataset Characteristics
Sequential
Subject Area
Biology
Associated Tasks
Classification
Feature Type
Categorical
# Instances
128
# Features
-
Dataset Information
Additional Information
This is a data set used by Ning Qian and Terry Sejnowski in their study using a neural net to predict the secondary structure of certain globular proteins [1]. The idea is to take a linear sequence of amino acids and to predict, for each of these amino acids, what secondary structure it is a part of within the protein. There are three choices: alpha-helix, beta-sheet, and random-coil. The data set contains both a large set of training data and a distinct set of data that can be used for testing the resulting network. Qian and Sejnowski use a Nettalk-like approach and report an accuracy of 64.3% on the test set, and they speculate that this is about the best that can be done using only local context. There is also a domain theory in the folder, donated and created by Jude Shavlik & Rich Maclin
Has Missing Values?
No
Dataset Files
File | Size |
---|---|
protein-secondary-structure.train | 71.8 KB |
protein-secondary-structure.test | 14.2 KB |
protein-secondary-structure.theory | 11.2 KB |
protein-secondary-structure.names | 1.9 KB |
Index | 285 Bytes |
Reviews
There are no reviews for this dataset yet.
pip install ucimlrepo
from ucimlrepo import fetch_ucirepo # fetch dataset molecular_biology_protein_secondary_structure = fetch_ucirepo(id=68) # data (as pandas dataframes) X = molecular_biology_protein_secondary_structure.data.features y = molecular_biology_protein_secondary_structure.data.targets # metadata print(molecular_biology_protein_secondary_structure.metadata) # variable information print(molecular_biology_protein_secondary_structure.variables)
Sejnowski, T. & Qian, N. (1988). Molecular Biology (Protein Secondary Structure) [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C5SP4F.
Creators
Terry Sejnowski
Ning Qian
DOI
License
This dataset is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.
This allows for the sharing and adaptation of the datasets for any purpose, provided that the appropriate credit is given.