
Molecular Biology (Protein Secondary Structure)
From CMU connectionist bench repository; Classifies secondary structure of certain globular proteins
Dataset Characteristics
Sequential
Subject Area
Life Science
Associated Tasks
Classification
Feature Type
Categorical
# Instances
128
# Features
-
Dataset Information
Additional Information
This is a data set used by Ning Qian and Terry Sejnowski in their study using a neural net to predict the secondary structure of certain globular proteins [1]. The idea is to take a linear sequence of amino acids and to predict, for each of these amino acids, what secondary structure it is a part of within the protein. There are three choices: alpha-helix, beta-sheet, and random-coil. The data set contains both a large set of training data and a distinct set of data that can be used for testing the resulting network. Qian and Sejnowski use a Nettalk-like approach and report an accuracy of 64.3% on the test set, and they speculate that this is about the best that can be done using only local context. There is also a domain theory in the folder, donated and created by Jude Shavlik & Rich Maclin
Has Missing Values?
No
Sejnowski,Terry and Qian,Ning. Molecular Biology (Protein Secondary Structure). UCI Machine Learning Repository. https://doi.org/10.24432/C5SP4F.
@misc{misc_molecular_biology_(protein_secondary_structure)_68, author = {Sejnowski,Terry and Qian,Ning}, title = {{Molecular Biology (Protein Secondary Structure)}}, howpublished = {UCI Machine Learning Repository}, note = {{DOI}: https://doi.org/10.24432/C5SP4F} }
pip install ucimlrepo
from ucimlrepo import fetch_ucirepo # fetch dataset molecular_biology_protein_secondary_structure = fetch_ucirepo(id=68) # data (as pandas dataframes) X = molecular_biology_protein_secondary_structure.data.features y = molecular_biology_protein_secondary_structure.data.targets # metadata print(molecular_biology_protein_secondary_structure.metadata) # variable information print(molecular_biology_protein_secondary_structure.variables)
Creators
Terry Sejnowski
Ning Qian
DOI
License
This dataset is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.
This allows for the sharing and adaptation of the datasets for any purpose, provided that the appropriate credit is given.