Similarity Prediction
Donated on 10/27/2022
Molecular similarity assessments by expert chemists. Useful for the prediction of molecular similarity evaluations by humans.
Dataset Characteristics
Tabular, Image
Subject Area
Physics and Chemistry
Associated Tasks
Classification
Feature Type
-
# Instances
200
# Features
-
Dataset Information
For what purpose was the dataset created?
Molecular similarity is an impressively broad topic with many implications in several areas of chemistry. Its roots lie in the paradigm that ‘similar molecules have similar properties’. For this reason, methods for determining molecular similarity find wide application in pharmaceutical companies, e.g., in the context of structure-activity relationships. The similarity evaluation is also used in the field of chemical legislation, specifically in the procedure to judge if a new molecule can obtain the status of orphan drug with the consequent financial benefits. For this procedure, the European Medicines Agency uses experts’ judgments. It is clear that the perception of the similarity depends on the observer, so the development of models to reproduce the human perception is useful. Models built on the dataset can be useful to reduce or assist human efforts in future evaluations.
Who funded the creation of the dataset?
The dataset was created by Enrico Gandini during his PhD at Università degli Studi di Milano.
What do the instances in this dataset represent?
Two CSV files containing the similarity assessments, the SMILES representation of the molecules, and the molecular descriptors described in the paper. Accompanied are the 2D and 3D pictures shown to the experts for similarity evaluation.
Are there recommended data splits?
In the paper, the original dataset was used to test the models built on the new dataset, and vice versa. New models can take advantage by a combination of the two datasets.
Was there any data preprocessing performed?
Standardized SMILES representations were obtained with RDKit and MolVS. Molecular descriptors were calculated with KNIME, RDKit, and OpenEye Omega and ROCS.
Additional Information
The dataset is composed of two parts: the original dataset and the new dataset. The molecules from the original dataset were standardized and processed in the same way as the molecules in the new dataset, as described in the paper, and that is the reason for inclusion of both parts in this dataset.
Has Missing Values?
No
Introductory Paper
By Enrico Gandini, G. Marcou, F. Bonachéra, A. Varnek, S. Pieraccini, M. Sironi. 2022
Published in International journal of molecular sciences
Dataset Files
File | Size |
---|---|
dataset_Similarity_Prediction/new_dataset/images_2D/image_molecule_054b.svg | 24.8 KB |
dataset_Similarity_Prediction/new_dataset/images_2D/image_molecule_054a.svg | 24.3 KB |
dataset_Similarity_Prediction/new_dataset/images_2D/image_molecule_076b.svg | 23 KB |
dataset_Similarity_Prediction/original_training_set/images_2D/image_molecule_100a.svg | 22.8 KB |
dataset_Similarity_Prediction/original_training_set/images_2D/image_molecule_020b.svg | 22.2 KB |
0 to 5 of 802
Reviews
There are no reviews for this dataset yet.
pip install ucimlrepo
from ucimlrepo import fetch_ucirepo # fetch dataset similarity_prediction = fetch_ucirepo(id=750) # data (as pandas dataframes) X = similarity_prediction.data.features y = similarity_prediction.data.targets # metadata print(similarity_prediction.metadata) # variable information print(similarity_prediction.variables)
Gandini, E. (2022). Similarity Prediction [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C5WW4F.
Creators
Enrico Gandini
enricogandini93@gmail.com
DOI
License
This dataset is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.
This allows for the sharing and adaptation of the datasets for any purpose, provided that the appropriate credit is given.