![]() Center for Machine Learning and Intelligent Systems |
About
Citation Policy
Donate a Data Set
Contact
View ALL Data Sets |
Source: Creator: Christian B. Thygesen, University of Copenhagen. Email: christiank.thygesen '@' di dot ku . dk
Data Set Information: The dataset consists of protein fragments of length nine, called 9mers, derived from 3,733 proteins selected by cullpdb [1]. All proteins have 1) resolution less than 1.6 Ångström, 2) R-factor less than 0.25, 3) sequence identity below 20%. In addition, all proteins with identity above 20% to CASP13 targets are removed. All torsion angle-pairs are in the allowed region of the Ramachandran plot (fragments containing outliers were detected by the Ramalyze function of the crystallography software PHENIX [1] and removed). The dataset has ∼158,000 entries randomly split into train, test, and validation sets with a 60/20/20 split.
Attribute Information: secondary_structure: 9 secondary structure labels
Relevant Papers: Christian B. Thygesen, Ahmad Salim Al-Sibahi, Christian S. Steenmanns, Lys S. Moreta, Anders B. Sørensen, Thomas Hamelryck. 'Efficient Generative Modelling of Protein StructureFragments using a Deep Markov Model'. International Conference on Machine Learning 2021 (to appear) Citation Request: Christian B. Thygesen, Ahmad Salim Al-Sibahi, Christian S. Steenmanns, Lys S. Moreta, Anders B. Sørensen, Thomas Hamelryck. 'Efficient Generative Modelling of Protein StructureFragments using a Deep Markov Model'. International Conference on Machine Learning 2021 (to appear) |
Supported By: |
![]() |
In Collaboration With: |
![]() |