Glioma Grading Clinical and Mutation Features
Donated on 12/13/2022
Gliomas are the most common primary tumors of the brain. They can be graded as LGG (Lower-Grade Glioma) or GBM (Glioblastoma Multiforme) depending on the histological/imaging criteria. Clinical and molecular/mutation factors are also very crucial for the grading process. Molecular tests are expensive to help accurately diagnose glioma patients. In this dataset, the most frequently mutated 20 genes and 3 clinical features are considered from TCGA-LGG and TCGA-GBM brain glioma projects. The prediction task is to determine whether a patient is LGG or GBM with a given clinical and molecular/mutation features. The main objective is to find the optimal subset of mutation genes and clinical features for the glioma grading process to improve performance and reduce costs.
Dataset Characteristics
Tabular, Multivariate
Subject Area
Health and Medicine
Associated Tasks
Classification, Other
Feature Type
Real, Categorical, Integer
# Instances
839
# Features
23
Dataset Information
For what purpose was the dataset created?
Gliomas are the most common primary tumors of the brain. They can be graded as LGG (Lower-Grade Glioma) or GBM (Glioblastoma Multiforme) depending on the histological/imaging criteria. Clinical and molecular/mutation factors are also very crucial for the grading process. Molecular tests are expensive to help accurately diagnose glioma patients. In this dataset, the most frequently mutated 20 genes and 3 clinical features are considered from TCGA-LGG and TCGA-GBM brain glioma projects. The prediction task is to determine whether a patient is LGG or GBM with a given clinical and molecular/mutation features. The main objective is to find the optimal subset of mutation genes and clinical features for the glioma grading process to improve performance and reduce costs.
Who funded the creation of the dataset?
The Cancer Genome Atlas (TCGA) Project – NCI
What do the instances in this dataset represent?
In this dataset, the instances represent the records of patients who have brain glioma. The dataset was constructed based on TCGA-LGG and TCGA-GBM brain glioma projects. Each record is characterized by 20 molecular features (each of which can be mutated or not_mutated (wildtype) depending on the TCGA Case_ID) and 3 clinical features (concerning the demographics of the patient).
Are there recommended data splits?
No. We suggest 10-fold cross-validation for feature selection, classification etc.
Does the dataset contain data that might be considered sensitive in any way?
There is information about race, age, and gender of the patient.
Was there any data preprocessing performed?
Yes. The original and preprocessed files differ in the following ways: - There are 23 instances in the original file where Gender, Age_at_diagnosis, or Race feature values are ‘--’, or ‘not reported’. These instances were filtered out in the preprocessed dataset. - Despite being present in the original dataset, we do not include the columns Project, Case_ID, and Primary_Diagnosis columns in the preprocessed dataset. - Age_at_diagnosis feature values were converted from string to continuous value by adding day information to the corresponding year information in the dataset as a floating-point number for the preprocessing stage. All processed and unprocessed files also exist in this directory. Below is a list of the additional columns of the original dataset file (and their corresponding description): - Project column represents corresponding TCGA-LGG or TCGA-GBM project names. - Case_ID column refers to the related project Case_ID information. - Primary_Diagnosis column provides information related to the type of primary diagnosis.
Has Missing Values?
No
Introductory Paper
By E. Tasci, Y. Zhuge, Harpreet Kaur, K. Camphausen, A. Krauze. 2022
Published in International Journal of Molecular Sciences
Variables Table
Variable Name | Role | Type | Demographic | Description | Units | Missing Values |
---|---|---|---|---|---|---|
Grade | Target | Categorical | Glioma grade class information (0 = "LGG"; 1 = "GBM") | N/A | no | |
Gender | Feature | Categorical | Gender | Gender (0 = "male"; 1 = "female") | N/A | no |
Age_at_diagnosis | Feature | Continuous | Age | Age at diagnosis with the calculated number of days | years | no |
Race | Feature | Categorical | Race | Race (0 = "white"; 1 = "black or african American"; 2 = "asian"; 3 = "american indian or alaska native") | N/A | no |
IDH1 | Feature | Categorical | isocitrate dehydrogenase (NADP(+))1 (0 = NOT_MUTATED; 1= MUTATED) | N/A | no | |
TP53 | Feature | Categorical | tumor protein p53 (0 = NOT_MUTATED; 1 = MUTATED) | N/A | no | |
ATRX | Feature | Categorical | ATRX chromatin remodeler (0 = NOT_MUTATED; 1 = MUTATED) | N/A | no | |
PTEN | Feature | Categorical | phosphatase and tensin homolog (0 = NOT_MUTATED; 1 = MUTATED) | N/A | no | |
EGFR | Feature | Categorical | epidermal growth factor receptor (0 = NOT_MUTATED; 1 = MUTATED) | N/A | no | |
CIC | Feature | Categorical | capicua transcriptional repressor (0 = NOT_MUTATED; 1 = MUTATED) | N/A | no |
0 to 10 of 24
Dataset Files
File | Size |
---|---|
TCGA_GBM_LGG_Mutations_all.csv | 258.8 KB |
TCGA_InfoWithGrade.csv | 42.7 KB |
Reviews
There are no reviews for this dataset yet.
pip install ucimlrepo
from ucimlrepo import fetch_ucirepo # fetch dataset glioma_grading_clinical_and_mutation_features = fetch_ucirepo(id=759) # data (as pandas dataframes) X = glioma_grading_clinical_and_mutation_features.data.features y = glioma_grading_clinical_and_mutation_features.data.targets # metadata print(glioma_grading_clinical_and_mutation_features.metadata) # variable information print(glioma_grading_clinical_and_mutation_features.variables)
Tasci, E., Camphausen, K., Krauze, A., & Zhuge, Y. (2022). Glioma Grading Clinical and Mutation Features [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C5R62J.
Keywords
Creators
Erdal Tasci
erdal.tasci@nih.gov
Radiation Oncology Branch (ROB), National Cancer Institute (NCI), National Institutes of Health (NIH), Building 10
Kevin Camphausen
camphauk@mail.nih.gov
Radiation Oncology Branch (ROB), National Cancer Institute (NCI), National Institutes of Health (NIH), Building 10
Andra Valentina Krauze
andra.krauze@nih.gov
Radiation Oncology Branch (ROB), National Cancer Institute (NCI), National Institutes of Health (NIH), Building 10
Ying Zhuge
zhugey@mail.nih.gov
Radiation Oncology Branch (ROB), National Cancer Institute (NCI), National Institutes of Health (NIH), Building 10
DOI
License
This dataset is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.
This allows for the sharing and adaptation of the datasets for any purpose, provided that the appropriate credit is given.