9mers from cullpdb

Donated on 8/21/2023

The dataset consists of protein fragments of length nine, called 9mers, derived from 3,733 proteins selected by cullpdb [1]. All proteins have 1) resolution less than 1.6 angstrom, 2) R-factor less than 0.25, 3) sequence identity below 20%. In addition, all proteins with identity above 20% to CASP13 targets are removed. All torsion angle-pairs are in the allowed region of the Ramachandran plot (fragments containing outliers were detected by the Ramalyze function of the crystallography software PHENIX [1] and removed). The dataset has ~158,000 entries randomly split into train, test, and validation sets with a 60/20/20 split.

Dataset Characteristics

Sequential

Subject Area

Biology

Associated Tasks

Classification, Regression

Feature Type

Real

# Instances

158716

# Features

Dataset Information

Has Missing Values?

Introductory Paper

Efficient Generative Modelling of Protein Structure Fragments using a Deep Markov Model

By Christian B. Thygesen, Ahmad Salim Al-Sibahi, Christian S. Steenmanns, L. S. Moreta, A. B. Sørensen, T. Hamelryck. 2021

Published in bioRxiv

Variable Information

secondary_structure: 9 secondary structure labels angle1: 9 phi torsion angles ([-pi,pi)) angle2: 9 psi torsion angles ([-pi,pi)) amino_acids: 9 amino acid labels

Dataset Files

File	Size
9mers.zip	44.1 MB

Reviews

There are no reviews for this dataset yet.

Download (44.1 MB)

1 citations

3106 views

Keywords

protein sequencing

Creators

Christian Thygesen

christiank.thygesen@di.ku.dk

University of Copenhagen

DOI

10.24432/C58024

License

This dataset is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.

This allows for the sharing and adaptation of the datasets for any purpose, provided that the appropriate credit is given.