Center for Machine Learning and Intelligent Systems
About  Citation Policy  Donate a Data Set  Contact


Repository Web            Google
View ALL Data Sets

Deepfakes: Medical Image Tamper Detection Data Set
Download: Data Folder, Data Set Description

Abstract: Medical deepfakes: CT scans of human lungs, where some have been tampered with cancer added/removed. Can you find them?

Data Set Characteristics:  

Multivariate

Number of Instances:

20000

Area:

Computer

Attribute Characteristics:

Real

Number of Attributes:

200000

Date Donated

2020-03-11

Associated Tasks:

Classification

Missing Values?

N/A

Number of Web Hits:

2878


Source:

Yisroel Mirsky
@post.bgu.ac.il
Ben-Gurion University of the Negev


Data Set Information:

Attackers have the ability to intercept and add/remove medical evidence in medical imagery with high realism using deep learning. In this dataset we present medical deepfakes: 3D CT scans of human lungs, where some have been tampered with real cancer removed and with fake cancer injected. The objective of this dataset is to distinguish between real and fake cancers, and identify where medical scans have been tampered. Three expert radiologists have evaluated this dataset and could not reliably tell the difference between real and fake cancers, meaning that the fake cancers are realistic and this detection task is very challenging. For more information, please see our paper 'CT-GAN'.


The dataset consists of two sets (80 scans and 20 scans). The first 80 were used in a blind trial with the radiologists (they weren't told they were tampered), and the 20 scans were used in an open trial with the radiologists (they were told the truth and asked to identify them).

Provided with the scans is a table with the ground truth. For each scan, where a cancer is located (x, y, and z [slice#]) and its classification. A location can be classified as being:
True-Benign, (TB): A location that actually has no cancer
True-Malicious (TM): A location that has real cancer
False-Benign (FB): A location that has real cancer, but it was removed.
False-Malicious (FM): A location that does not have cancer, but fake cancer was injected there.

Access to the dataset is via this link: [Web Link]


Attribute Information:

Each scan is in the medical dicom format, but it can be loaded as a 3D matrix with Python by using the tools provided in our code repository: [Web Link]

A scan is basically a series of 512x512 images. The series is usually about 100-300 slices long (the z axis). Cancers can occupy multiple slices along the z-axis.
The value at each pixel is the Hounsfield unit (radiodensity) at that location.


Relevant Papers:

[Web Link]
Mirsky, Yisroel, et al. 'CT-GAN: Malicious tampering of 3D medical imagery using deep learning.' 28th {USENIX} Security Symposium ({USENIX} Security 19). 2019.
[Web Link],11&hl=en



Citation Request:

If you use this data, please cite:
Mirsky, Yisroel, et al. 'CT-GAN: Malicious tampering of 3D medical imagery using deep learning.' 28th {USENIX} Security Symposium ({USENIX} Security 19). 2019.

The original medical imagery is from:
Armato III, Samuel G., McLennan, Geoffrey, Bidaut, Luc, McNitt-Gray, Michael F., Meyer, Charles R., Reeves, Anthony P., … Clarke, Laurence P. (2015). Data From LIDC-IDRI. The Cancer Imaging Archive. [Web Link]
Published under the Creative Commons Attribution 3.0 Unported License ([Web Link])


Supported By:

 In Collaboration With:

About  ||  Citation Policy  ||  Donation Policy  ||  Contact  ||  CML