Center for Machine Learning and Intelligent Systems
About  Citation Policy  Donate a Data Set  Contact


Repository Web            Google
View ALL Data Sets

× Check out the beta version of the new UCI Machine Learning Repository we are currently testing! Contact us if you have any issues, questions, or concerns. Click here to try out the new site.

9mers from cullpdb Data Set
Download: Data Folder, Data Set Description

Abstract: Provide a short description of your data set (less than 200 characters).

Data Set Characteristics:  

Sequential

Number of Instances:

158716

Area:

Life

Attribute Characteristics:

Real

Number of Attributes:

4

Date Donated

2021-05-25

Associated Tasks:

Classification, Regression

Missing Values?

N/A

Number of Web Hits:

23093


Source:

Creator: Christian B. Thygesen, University of Copenhagen. Email: christiank.thygesen '@' di dot ku . dk
Donator: Ola Rønning, University of Copenhagen. Email: ola '@' di . ku dot dk


Data Set Information:

The dataset consists of protein fragments of length nine, called 9mers, derived from 3,733 proteins selected by cullpdb [1]. All proteins have 1) resolution less than 1.6 Ångström, 2) R-factor less than 0.25, 3) sequence identity below 20%. In addition, all proteins with identity above 20% to CASP13 targets are removed. All torsion angle-pairs are in the allowed region of the Ramachandran plot (fragments containing outliers were detected by the Ramalyze function of the crystallography software PHENIX [1] and removed). The dataset has ∼158,000 entries randomly split into train, test, and validation sets with a 60/20/20 split.

References:
[1] Wang, G., & Dunbrack, R. L. (2005). PISCES: recent improvements to a PDB sequence culling server. Nucleic acids research, 33(suppl_2), W94-W98.
[2] Liebschner, D., Afonine, P. V., Baker, M. L., Bunkóczi, G., Chen, V. B., Croll, T. I., ... & Adams, P. D. (2019). Macromolecular structure determination using X-rays, neutrons, and electrons: recent developments in Phenix. Acta Crystallographica Section D: Structural Biology, 75(10), 861-877.


Attribute Information:

secondary_structure: 9 secondary structure labels
angle1: 9 phi torsion angles ([-pi,pi))
angle2: 9 psi torsion angles ([-pi,pi))
amino_acids: 9 amino acid labels

Map from aa to index in aa_to_index.csv.
Map from secondary structure label to index in secondary_structure_to_index.csv.


Relevant Papers:

Christian B. Thygesen, Ahmad Salim Al-Sibahi, Christian S. Steenmanns, Lys S. Moreta, Anders B. Sørensen, Thomas Hamelryck. 'Efficient Generative Modelling of Protein StructureFragments using a Deep Markov Model'. International Conference on Machine Learning 2021 (to appear)



Citation Request:

Christian B. Thygesen, Ahmad Salim Al-Sibahi, Christian S. Steenmanns, Lys S. Moreta, Anders B. Sørensen, Thomas Hamelryck. 'Efficient Generative Modelling of Protein StructureFragments using a Deep Markov Model'. International Conference on Machine Learning 2021 (to appear)


Supported By:

 In Collaboration With:

About  ||  Citation Policy  ||  Donation Policy  ||  Contact  ||  CML