Glioma Grading Clinical and Mutation Features

Donated on 12/13/2022

Gliomas are the most common primary tumors of the brain. They can be graded as LGG (Lower-Grade Glioma) or GBM (Glioblastoma Multiforme) depending on the histological/imaging criteria. Clinical and molecular/mutation factors are also very crucial for the grading process. Molecular tests are expensive to help accurately diagnose glioma patients. In this dataset, the most frequently mutated 20 genes and 3 clinical features are considered from TCGA-LGG and TCGA-GBM brain glioma projects. The prediction task is to determine whether a patient is LGG or GBM with a given clinical and molecular/mutation features. The main objective is to find the optimal subset of mutation genes and clinical features for the glioma grading process to improve performance and reduce costs.

Dataset Characteristics

Tabular, Multivariate

Subject Area

Health and Medicine

Associated Tasks

Classification, Other

Feature Type

Real, Categorical, Integer

# Instances

839

# Features

Dataset Information

For what purpose was the dataset created?

Who funded the creation of the dataset?

The Cancer Genome Atlas (TCGA) Project – NCI

What do the instances in this dataset represent?

In this dataset, the instances represent the records of patients who have brain glioma. The dataset was constructed based on TCGA-LGG and TCGA-GBM brain glioma projects. Each record is characterized by 20 molecular features (each of which can be mutated or not_mutated (wildtype) depending on the TCGA Case_ID) and 3 clinical features (concerning the demographics of the patient).

Are there recommended data splits?

No. We suggest 10-fold cross-validation for feature selection, classification etc.

Does the dataset contain data that might be considered sensitive in any way?

There is information about race, age, and gender of the patient.

Was there any data preprocessing performed?

Yes. The original and preprocessed files differ in the following ways: - There are 23 instances in the original file where Gender, Age_at_diagnosis, or Race feature values are ‘--’, or ‘not reported’. These instances were filtered out in the preprocessed dataset. - Despite being present in the original dataset, we do not include the columns Project, Case_ID, and Primary_Diagnosis columns in the preprocessed dataset. - Age_at_diagnosis feature values were converted from string to continuous value by adding day information to the corresponding year information in the dataset as a floating-point number for the preprocessing stage. All processed and unprocessed files also exist in this directory. Below is a list of the additional columns of the original dataset file (and their corresponding description): - Project column represents corresponding TCGA-LGG or TCGA-GBM project names. - Case_ID column refers to the related project Case_ID information. - Primary_Diagnosis column provides information related to the type of primary diagnosis.

Has Missing Values?

Introductory Paper

Hierarchical Voting-Based Feature Selection and Ensemble Learning Model Scheme for Glioma Grading with Clinical and Molecular Characteristics

By E. Tasci, Y. Zhuge, Harpreet Kaur, K. Camphausen, A. Krauze. 2022

Published in International Journal of Molecular Sciences

Variables Table

Variable Name	Role	Type	Demographic	Description	Units	Missing Values
Grade	Target	Categorical		Glioma grade class information (0 = "LGG"; 1 = "GBM")	N/A	no
Gender	Feature	Categorical	Gender	Gender (0 = "male"; 1 = "female")	N/A	no
Age_at_diagnosis	Feature	Continuous	Age	Age at diagnosis with the calculated number of days	years	no
Race	Feature	Categorical	Race	Race (0 = "white"; 1 = "black or african American"; 2 = "asian"; 3 = "american indian or alaska native")	N/A	no
IDH1	Feature	Categorical		isocitrate dehydrogenase (NADP(+))1 (0 = NOT_MUTATED; 1= MUTATED)	N/A	no
TP53	Feature	Categorical		tumor protein p53 (0 = NOT_MUTATED; 1 = MUTATED)	N/A	no
ATRX	Feature	Categorical		ATRX chromatin remodeler (0 = NOT_MUTATED; 1 = MUTATED)	N/A	no
PTEN	Feature	Categorical		phosphatase and tensin homolog (0 = NOT_MUTATED; 1 = MUTATED)	N/A	no
EGFR	Feature	Categorical		epidermal growth factor receptor (0 = NOT_MUTATED; 1 = MUTATED)	N/A	no
CIC	Feature	Categorical		capicua transcriptional repressor (0 = NOT_MUTATED; 1 = MUTATED)	N/A	no

Rows per page

0 to 10 of 24

Dataset Files

File	Size
TCGA_GBM_LGG_Mutations_all.csv	258.8 KB
TCGA_InfoWithGrade.csv	42.7 KB

Reviews

There are no reviews for this dataset yet.

Download (301.8 KB)

1 citations

27770 views

Keywords

Mutation Tumor grading

Creators

Erdal Tasci

erdal.tasci@nih.gov

Radiation Oncology Branch (ROB), National Cancer Institute (NCI), National Institutes of Health (NIH), Building 10

Kevin Camphausen

camphauk@mail.nih.gov

Radiation Oncology Branch (ROB), National Cancer Institute (NCI), National Institutes of Health (NIH), Building 10

Andra Valentina Krauze

andra.krauze@nih.gov

Radiation Oncology Branch (ROB), National Cancer Institute (NCI), National Institutes of Health (NIH), Building 10

Ying Zhuge

zhugey@mail.nih.gov

Radiation Oncology Branch (ROB), National Cancer Institute (NCI), National Institutes of Health (NIH), Building 10

DOI

10.24432/C5R62J

License

This dataset is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.

This allows for the sharing and adaptation of the datasets for any purpose, provided that the appropriate credit is given.

Glioma Grading Clinical and Mutation Features

Donated on 12/13/2022

Dataset Characteristics

Subject Area

Associated Tasks

Feature Type

# Instances

# Features

Dataset Information

Introductory Paper

Variables Table

Dataset Files

Reviews

Write a Review

Keywords

Creators

DOI

License