9mers from cullpdb

Donated on 8/21/2023

The dataset consists of protein fragments of length nine, called 9mers, derived from 3,733 proteins selected by cullpdb [1]. All proteins have 1) resolution less than 1.6 angstrom, 2) R-factor less than 0.25, 3) sequence identity below 20%. In addition, all proteins with identity above 20% to CASP13 targets are removed. All torsion angle-pairs are in the allowed region of the Ramachandran plot (fragments containing outliers were detected by the Ramalyze function of the crystallography software PHENIX [1] and removed). The dataset has ~158,000 entries randomly split into train, test, and validation sets with a 60/20/20 split.

Dataset Characteristics


Subject Area

Life Science

Associated Tasks

Classification, Regression

Feature Type


# Instances


# Features


Dataset Information

Has Missing Values?


Introductory Paper

Efficient Generative Modelling of Protein Structure Fragments using a Deep Markov Model

By Christian B. Thygesen, Ahmad Salim Al-Sibahi, Christian S. Steenmanns, L. S. Moreta, A. B. Sørensen, T. Hamelryck. 2021

Published in bioRxiv

Variable Information

secondary_structure: 9 secondary structure labels angle1: 9 phi torsion angles ([-pi,pi)) angle2: 9 psi torsion angles ([-pi,pi)) amino_acids: 9 amino acid labels

1 citations


protein sequencing


Christian Thygesen


University of Copenhagen


By using the UCI Machine Learning Repository, you acknowledge and accept the cookies and privacy practices used by the UCI Machine Learning Repository.

Read Policy