HEPMASS

Donated on 1/27/2016

The search for exotic particles requires sorting through a large number of collisions to find the events of interest. This data set challenges one to detect a new particle of unknown mass.

Dataset Characteristics

Multivariate

Subject Area

Physics and Chemistry

Associated Tasks

Classification

Feature Type

Real

# Instances

10500000

# Features

-

Dataset Information

Additional Information

Machine learning is used in high-energy physics experiments to search for the signatures of exotic particles. These signatures are learned from Monte Carlo simulations of the collisions that produce these particles and the resulting decay products. In each of the three data sets here, the goal is to separate particle-producing collisions from a background source. The mass of the new particle is unknown, so three separate data sets are provided. In each data set, 50% of the data is from a signal process, while 50% is from the background process. The data is separated into a training set of 7 million examples and a test set of 3.5 million for each. 1) In the '1000' dataset, the signal particle has mass=1000. (Note: this dataset does not include a mass feature since all signal examples have the same mass.) 2) In the 'not1000' dataset, the signal particle's mass is drawn uniformly from the set {500, 750, 1250, 1500}. The mass is included as an input feature; for the background examples, the mass is selected randomly from this same set. 3) In the 'all' dataset, the signal particle's mass is drawn uniformly from the set {500, 750, 1000, 1250, 1500}. The mass is included as an input feature; for the background examples, the mass is selected randomly from this same set.

Has Missing Values?

No

Variables Table

Variable NameRoleTypeDescriptionUnitsMissing Values
no
no
no
no
no
no
no
no
no
no

0 to 10 of 28

Additional Variable Information

The first column is the class label (1 for signal, 0 for background), followed by the 27 normalized features (22 low-level features then 5 high-level features), and a 28th mass feature for datasets 2 and 3. See the original paper for more detailed information. There is a header line in each file.

Dataset Files

FileSize
all_train.csv.gz1.6 GB
not1000_train.csv.gz1.6 GB
1000_train.csv.gz1.6 GB
all_test.csv.gz839.6 MB
not1000_test.csv.gz838.9 MB

0 to 5 of 6

Reviews

There are no reviews for this dataset yet.

Login to Write a Review
Download (7.4 GB)
0 citations
3054 views

Creators

Daniel Whiteson

License

By using the UCI Machine Learning Repository, you acknowledge and accept the cookies and privacy practices used by the UCI Machine Learning Repository.

Read Policy