9mers from cullpdb
Donated on 8/21/2023
The dataset consists of protein fragments of length nine, called 9mers, derived from 3,733 proteins selected by cullpdb [1]. All proteins have 1) resolution less than 1.6 angstrom, 2) R-factor less than 0.25, 3) sequence identity below 20%. In addition, all proteins with identity above 20% to CASP13 targets are removed. All torsion angle-pairs are in the allowed region of the Ramachandran plot (fragments containing outliers were detected by the Ramalyze function of the crystallography software PHENIX [1] and removed). The dataset has ~158,000 entries randomly split into train, test, and validation sets with a 60/20/20 split.
Dataset Characteristics
Sequential
Subject Area
Biology
Associated Tasks
Classification, Regression
Feature Type
Real
# Instances
158716
# Features
-
Dataset Information
Has Missing Values?
No
Introductory Paper
By Christian B. Thygesen, Ahmad Salim Al-Sibahi, Christian S. Steenmanns, L. S. Moreta, A. B. Sørensen, T. Hamelryck. 2021
Published in bioRxiv
Variable Information
secondary_structure: 9 secondary structure labels angle1: 9 phi torsion angles ([-pi,pi)) angle2: 9 psi torsion angles ([-pi,pi)) amino_acids: 9 amino acid labels
Dataset Files
File | Size |
---|---|
9mers.zip | 44.1 MB |
Reviews
There are no reviews for this dataset yet.
pip install ucimlrepo
from ucimlrepo import fetch_ucirepo # fetch dataset 9mers_from_cullpdb = fetch_ucirepo(id=866) # data (as pandas dataframes) X = 9mers_from_cullpdb.data.features y = 9mers_from_cullpdb.data.targets # metadata print(9mers_from_cullpdb.metadata) # variable information print(9mers_from_cullpdb.variables)
Thygesen, C. (2021). 9mers from cullpdb [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C58024.
Keywords
Creators
Christian Thygesen
christiank.thygesen@di.ku.dk
University of Copenhagen
DOI
License
This dataset is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.
This allows for the sharing and adaptation of the datasets for any purpose, provided that the appropriate credit is given.