Similarity Prediction

Donated on 10/27/2022

Molecular similarity assessments by expert chemists. Useful for the prediction of molecular similarity evaluations by humans.

Dataset Characteristics

Tabular, Image

Subject Area

Physics and Chemistry

Associated Tasks

Classification

Feature Type

# Instances

200

# Features

Dataset Information

For what purpose was the dataset created?

Molecular similarity is an impressively broad topic with many implications in several areas of chemistry. Its roots lie in the paradigm that ‘similar molecules have similar properties’. For this reason, methods for determining molecular similarity find wide application in pharmaceutical companies, e.g., in the context of structure-activity relationships. The similarity evaluation is also used in the field of chemical legislation, specifically in the procedure to judge if a new molecule can obtain the status of orphan drug with the consequent financial benefits. For this procedure, the European Medicines Agency uses experts’ judgments. It is clear that the perception of the similarity depends on the observer, so the development of models to reproduce the human perception is useful. Models built on the dataset can be useful to reduce or assist human efforts in future evaluations.

Who funded the creation of the dataset?

The dataset was created by Enrico Gandini during his PhD at Università degli Studi di Milano.

What do the instances in this dataset represent?

Two CSV files containing the similarity assessments, the SMILES representation of the molecules, and the molecular descriptors described in the paper. Accompanied are the 2D and 3D pictures shown to the experts for similarity evaluation.

Are there recommended data splits?

In the paper, the original dataset was used to test the models built on the new dataset, and vice versa. New models can take advantage by a combination of the two datasets.

Was there any data preprocessing performed?

Standardized SMILES representations were obtained with RDKit and MolVS. Molecular descriptors were calculated with KNIME, RDKit, and OpenEye Omega and ROCS.

Additional Information

The dataset is composed of two parts: the original dataset and the new dataset. The molecules from the original dataset were standardized and processed in the same way as the molecules in the new dataset, as described in the paper, and that is the reason for inclusion of both parts in this dataset.

Has Missing Values?

Introductory Paper

By Enrico Gandini, G. Marcou, F. Bonachéra, A. Varnek, S. Pieraccini, M. Sironi. 2022

Published in International journal of molecular sciences

Dataset Files

File	Size
dataset_Similarity_Prediction/new_dataset/images_2D/image_molecule_054b.svg	24.8 KB
dataset_Similarity_Prediction/new_dataset/images_2D/image_molecule_054a.svg	24.3 KB
dataset_Similarity_Prediction/new_dataset/images_2D/image_molecule_076b.svg	23 KB
dataset_Similarity_Prediction/original_training_set/images_2D/image_molecule_100a.svg	22.8 KB
dataset_Similarity_Prediction/original_training_set/images_2D/image_molecule_020b.svg	22.2 KB

Rows per page

0 to 5 of 802

Download (1.6 MB)

1 citations

7674 views

Keywords

Chemistry Cheminformatics Molecular Similarity Small Molecule

Creators

Enrico Gandini

enricogandini93@gmail.com

DOI

10.24432/C5WW4F

License

This dataset is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.

This allows for the sharing and adaptation of the datasets for any purpose, provided that the appropriate credit is given.