Musk (Version 2)
Donated on 9/11/1994
The goal is to learn to predict whether new molecules will be musks or non-musks
Dataset Characteristics
Multivariate
Subject Area
Physics and Chemistry
Associated Tasks
Classification
Feature Type
Integer
# Instances
6598
# Features
166
Dataset Information
Additional Information
This dataset describes a set of 102 molecules of which 39 are judged by human experts to be musks and the remaining 63 molecules are judged to be non-musks. The goal is to learn to predict whether new molecules will be musks or non-musks. However, the 166 features that describe these molecules depend upon the exact shape, or conformation, of the molecule. Because bonds can rotate, a single molecule can adopt many different shapes. To generate this data set, all the low-energy conformations of the molecules were generated to produce 6,598 conformations. Then, a feature vector was extracted that describes each conformation. This many-to-one relationship between feature vectors and molecules is called the "multiple instance problem". When learning a classifier for this data, the classifier should classify a molecule as "musk" if ANY of its conformations is classified as a musk. A molecule should be classified as "non-musk" if NONE of its conformations is classified as a musk.
Has Missing Values?
No
Variables Table
Variable Name | Role | Type | Description | Units | Missing Values |
---|---|---|---|---|---|
molecule_name | ID | Categorical | Symbolic name of each molecule. Musks have names such as MUSK-188. Non-musks have names such as NON-MUSK-jp13. | no | |
conformation_name | ID | Categorical | Symbolic name of each conformation. These have the format MOL_ISO+CONF, where MOL is the molecule number, ISO is the stereoisomer number (usually 1), and CONF is the conformation number. | no | |
f1 | Feature | Integer | no | ||
f2 | Feature | Integer | no | ||
f3 | Feature | Integer | no | ||
f4 | Feature | Integer | no | ||
f5 | Feature | Integer | no | ||
f6 | Feature | Integer | no | ||
f7 | Feature | Integer | no | ||
f8 | Feature | Integer | no |
0 to 10 of 169
Additional Variable Information
molecule_name: Symbolic name of each molecule. Musks have names such as MUSK-188. Non-musks have names such as NON-MUSK-jp13. conformation_name: Symbolic name of each conformation. These have the format MOL_ISO+CONF, where MOL is the molecule number, ISO is the stereoisomer number (usually 1), and CONF is the conformation number. f1 through f162: These are "distance features" along rays (see paper cited above). The distances are measured in hundredths of Angstroms. The distances may be negative or positive, since they are actually measured relative to an origin placed along each ray. The origin was defined by a "consensus musk" surface that is no longer used. Hence, any experiments with the data should treat these feature values as lying on an arbitrary continuous scale. In particular, the algorithm should not make any use of the zero point or the sign of each feature value. f163: This is the distance of the oxygen atom in the molecule to a designated point in 3-space. This is also called OXY-DIS. f164: OXY-X: X-displacement from the designated point. f165: OXY-Y: Y-displacement from the designated point. f166: OXY-Z: Z-displacement from the designated point. class: 0 => non-musk, 1 => musk Please note that the molecule_name and conformation_name attributes should not be used to predict the class.
Dataset Files
File | Size |
---|---|
clean2.data.Z | 1.4 MB |
clean2.names | 66.1 KB |
clean2.info | 6.9 KB |
Index | 144 Bytes |
Reviews
There are no reviews for this dataset yet.
pip install ucimlrepo
from ucimlrepo import fetch_ucirepo # fetch dataset musk_version_2 = fetch_ucirepo(id=75) # data (as pandas dataframes) X = musk_version_2.data.features y = musk_version_2.data.targets # metadata print(musk_version_2.metadata) # variable information print(musk_version_2.variables)
Chapman, D. & Jain, A. (1994). Musk (Version 2) [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C51608.
Creators
David Chapman
Ajay Jain
DOI
License
This dataset is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.
This allows for the sharing and adaptation of the datasets for any purpose, provided that the appropriate credit is given.