9mers from cullpdb

Donated on 8/21/2023

The dataset consists of protein fragments of length nine, called 9mers, derived from 3,733 proteins selected by cullpdb [1]. All proteins have 1) resolution less than 1.6 angstrom, 2) R-factor less than 0.25, 3) sequence identity below 20%. In addition, all proteins with identity above 20% to CASP13 targets are removed. All torsion angle-pairs are in the allowed region of the Ramachandran plot (fragments containing outliers were detected by the Ramalyze function of the crystallography software PHENIX [1] and removed). The dataset has ~158,000 entries randomly split into train, test, and validation sets with a 60/20/20 split.

Dataset Characteristics

Sequential

Subject Area

Biology

Associated Tasks

Classification, Regression

Feature Type

Real

# Instances

158716

# Features

-

Dataset Information

Has Missing Values?

No

Introductory Paper

Efficient Generative Modelling of Protein Structure Fragments using a Deep Markov Model

By Christian B. Thygesen, Ahmad Salim Al-Sibahi, Christian S. Steenmanns, L. S. Moreta, A. B. Sørensen, T. Hamelryck. 2021

Published in bioRxiv

Variable Information

secondary_structure: 9 secondary structure labels angle1: 9 phi torsion angles ([-pi,pi)) angle2: 9 psi torsion angles ([-pi,pi)) amino_acids: 9 amino acid labels

Reviews

There are no reviews for this dataset yet.

Login to Write a Review
Download
1 citations
3543 views

Keywords

protein sequencing

Creators

Christian Thygesen

christiank.thygesen@di.ku.dk

University of Copenhagen

License

By using the UCI Machine Learning Repository, you acknowledge and accept the cookies and privacy practices used by the UCI Machine Learning Repository.

Read Policy