Similarity Prediction

Donated on 10/27/2022

Molecular similarity assessments by expert chemists. Useful for the prediction of molecular similarity evaluations by humans.

Dataset Characteristics

Tabular, Image

Subject Area

Physics and Chemistry

Associated Tasks

Classification

Feature Type

-

# Instances

200

# Features

-

Dataset Information

For what purpose was the dataset created?

Molecular similarity is an impressively broad topic with many implications in several areas of chemistry. Its roots lie in the paradigm that ‘similar molecules have similar properties’. For this reason, methods for determining molecular similarity find wide application in pharmaceutical companies, e.g., in the context of structure-activity relationships. The similarity evaluation is also used in the field of chemical legislation, specifically in the procedure to judge if a new molecule can obtain the status of orphan drug with the consequent financial benefits. For this procedure, the European Medicines Agency uses experts’ judgments. It is clear that the perception of the similarity depends on the observer, so the development of models to reproduce the human perception is useful. Models built on the dataset can be useful to reduce or assist human efforts in future evaluations.

Who funded the creation of the dataset?

The dataset was created by Enrico Gandini during his PhD at Università degli Studi di Milano.

What do the instances in this dataset represent?

Two CSV files containing the similarity assessments, the SMILES representation of the molecules, and the molecular descriptors described in the paper. Accompanied are the 2D and 3D pictures shown to the experts for similarity evaluation.

Are there recommended data splits?

In the paper, the original dataset was used to test the models built on the new dataset, and vice versa. New models can take advantage by a combination of the two datasets.

Was there any data preprocessing performed?

Standardized SMILES representations were obtained with RDKit and MolVS. Molecular descriptors were calculated with KNIME, RDKit, and OpenEye Omega and ROCS.

Additional Information

The dataset is composed of two parts: the original dataset and the new dataset. The molecules from the original dataset were standardized and processed in the same way as the molecules in the new dataset, as described in the paper, and that is the reason for inclusion of both parts in this dataset.

Has Missing Values?

No

Introductory Paper

Molecular Similarity Perception Based on Machine-Learning Models

By Enrico Gandini, G. Marcou, F. Bonachéra, A. Varnek, S. Pieraccini, M. Sironi. 2022

Published in International journal of molecular sciences

Dataset Files

FileSize
dataset_Similarity_Prediction/new_dataset/images_2D/image_molecule_054b.svg24.8 KB
dataset_Similarity_Prediction/new_dataset/images_2D/image_molecule_054a.svg24.3 KB
dataset_Similarity_Prediction/new_dataset/images_2D/image_molecule_076b.svg23 KB
dataset_Similarity_Prediction/original_training_set/images_2D/image_molecule_100a.svg22.8 KB
dataset_Similarity_Prediction/original_training_set/images_2D/image_molecule_020b.svg22.2 KB

0 to 5 of 802

Reviews

There are no reviews for this dataset yet.

Login to Write a Review
Download (1.6 MB)
1 citations
5375 views

Creators

Enrico Gandini

enricogandini93@gmail.com

License

By using the UCI Machine Learning Repository, you acknowledge and accept the cookies and privacy practices used by the UCI Machine Learning Repository.

Read Policy