PubChem Bioassay Data

Donated on 3/28/2011

These highly imbalanced bioassay datasets are from the differing types of screening that can be performed using HTS technology. 21 datasets were created from 12 bioassays.

Dataset Characteristics

Multivariate

Subject Area

Biology

Associated Tasks

Classification

Feature Type

Integer, Real

# Instances

# Features

Dataset Information

Additional Information

21 bioassay datasets generated from Pubchem. Both Primary and confirmatory bioassays (12 bioassays, 21 mixes)The data is provided in the same train/test split as the original paper. The compound IDs have been provided in separate files in case people wish to generate their own molecular representation. The order of the compound Ids is the same as the data files. â€¢ AID362 details the results of a primary screening bioassay for Formylpeptide Receptor Ligand Binding University from the New Mexico Center for Molecular Discovery. It is a relatively small dataset with 4279 compounds and with a ratio of 1 active to 70 inactive compounds (1.4% minority class). The compounds were selected on the basis of preliminary virtual screening of approximately 480,000 drug-like small molecules from Chemical Diversity Laboratories. â€¢ AID456 is a primary screen assay from the Burnham Center for Chemical Genomics for inhibition of TNFa induced VCAM-1 cell surface expression and consists of 9,982 compounds with a ratio of 1 active compound to 368 inactive compounds (0.27% minority). The compounds have been selected for their known drug-like properties and 9,431 meet the Rule of 5 [19]. â€¢ AID688 is the result of a primary screen for Yeast eIF2B from the Penn Center for Molecular Discovery and contains activity information of 27,198 compounds with a ratio of 1 active compound to 108 inactive compounds (0.91% minority). The screen is a reporter-gene assay and 25,656 of the compounds have known drug-like properties. â€¢ AID604 is a primary screening bioassay for Rho kinase 2 inhibitors from the Scripps Research Institute Molecular Screening Center. The bioassay contains activity information of 59,788 compounds with a ratio of 1 active compound to 281 inactive compounds (1.4%). 57,546 of the compounds have known drug-like properties. â€¢ AID373 is a primary screen from the Scripps Research Institute Molecular Screening Center for endothelial differentiation, sphingolipid G-protein-coupled receptor, 3. 59,788 compounds were screened with a ratio of 1 active compound to 963 inactive compounds (0.1%). 57,546 of the compounds screened had known drug-like properties. â€¢ AID746 is a primary screen from the Scripps Research Institute Molecular Screening Center for Mitogen-activated protein kinase. 59,788 compounds were screened with a ratio of 1 active compound to 162 inactive compounds (0.61%). 57,546 of the compounds screened had known drug-like properties. â€¢ AID687 is the result of a primary screen for coagulation factor XI from the Penn Center for Molecular Discovery and contains activity information of 33,067 compounds with a ratio of 1 active compound to 350 inactive compounds (0.28% minority). 30,353 of the compounds screened had known drug-like properties. â€¢ AID1608 is a different type of screening assay that was used to identify compounds that prevent HttQ103-induced cell death. National Institute of Neurological Disorders and Stroke Approved Drug Program. The compounds that prevent a release of a certain chemical into the growth medium are labelled as active and the remaining compounds are labelled as having inconclusive activity. AID1608 is a small dataset with 1,033 compounds and a ratio of 1 active to 14 inconclusive compounds (6.58% minority class). â€¢ AID644 confirmatory screen of AID604 â€¢ AID1284 confirmatory screen of AID746 â€¢ AID439 confirmatory screen of AID373 â€¢ AID721 confirmatory screen of AID746

Has Missing Values?

Variable Information

Each attribute has been fully described in the Open Access publication. The data is a mixture of boolean, integer and real values. Only 2 class - Active and Inactive. Highly Imbalanced.

Dataset Files

File	Size
VirtualScreeningData/AID373red_train.csv	22.3 MB
VirtualScreeningData/AID373AID439red_train.csv	22.2 MB
VirtualScreeningData/AID604red_train.csv	22.2 MB
VirtualScreeningData/AID746red_train.csv	22.2 MB
VirtualScreeningData/AID746AID1284red_train.csv	22.2 MB

Rows per page

0 to 5 of 75

Download (43.3 MB)

0 citations

1946 views

Creators

Amanda Schierz

DOI

10.24432/C5FK62

License

This dataset is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.

This allows for the sharing and adaptation of the datasets for any purpose, provided that the appropriate credit is given.