Glioma Grading Clinical and Mutation Features

Donated on 12/13/2022

Gliomas are the most common primary tumors of the brain. They can be graded as LGG (Lower-Grade Glioma) or GBM (Glioblastoma Multiforme) depending on the histological/imaging criteria. Clinical and molecular/mutation factors are also very crucial for the grading process. Molecular tests are expensive to help accurately diagnose glioma patients. In this dataset, the most frequently mutated 20 genes and 3 clinical features are considered from TCGA-LGG and TCGA-GBM brain glioma projects. The prediction task is to determine whether a patient is LGG or GBM with a given clinical and molecular/mutation features. The main objective is to find the optimal subset of mutation genes and clinical features for the glioma grading process to improve performance and reduce costs.

Dataset Characteristics

Tabular, Multivariate

Subject Area

Health and Medicine

Associated Tasks

Classification, Other

Feature Type

Real, Categorical, Integer

# Instances

839

# Features

23

Dataset Information

For what purpose was the dataset created?

Gliomas are the most common primary tumors of the brain. They can be graded as LGG (Lower-Grade Glioma) or GBM (Glioblastoma Multiforme) depending on the histological/imaging criteria. Clinical and molecular/mutation factors are also very crucial for the grading process. Molecular tests are expensive to help accurately diagnose glioma patients. In this dataset, the most frequently mutated 20 genes and 3 clinical features are considered from TCGA-LGG and TCGA-GBM brain glioma projects. The prediction task is to determine whether a patient is LGG or GBM with a given clinical and molecular/mutation features. The main objective is to find the optimal subset of mutation genes and clinical features for the glioma grading process to improve performance and reduce costs.

Who funded the creation of the dataset?

The Cancer Genome Atlas (TCGA) Project – NCI

What do the instances in this dataset represent?

In this dataset, the instances represent the records of patients who have brain glioma. The dataset was constructed based on TCGA-LGG and TCGA-GBM brain glioma projects. Each record is characterized by 20 molecular features (each of which can be mutated or not_mutated (wildtype) depending on the TCGA Case_ID) and 3 clinical features (concerning the demographics of the patient).

Are there recommended data splits?

No. We suggest 10-fold cross-validation for feature selection, classification etc.

Does the dataset contain data that might be considered sensitive in any way?

There is information about race, age, and gender of the patient.

Was there any data preprocessing performed?

Yes. The original and preprocessed files differ in the following ways: - There are 23 instances in the original file where Gender, Age_at_diagnosis, or Race feature values are ‘--’, or ‘not reported’. These instances were filtered out in the preprocessed dataset. - Despite being present in the original dataset, we do not include the columns Project, Case_ID, and Primary_Diagnosis columns in the preprocessed dataset. - Age_at_diagnosis feature values were converted from string to continuous value by adding day information to the corresponding year information in the dataset as a floating-point number for the preprocessing stage. All processed and unprocessed files also exist in this directory. Below is a list of the additional columns of the original dataset file (and their corresponding description): - Project column represents corresponding TCGA-LGG or TCGA-GBM project names. - Case_ID column refers to the related project Case_ID information. - Primary_Diagnosis column provides information related to the type of primary diagnosis.

Has Missing Values?

No

Introductory Paper

Hierarchical Voting-Based Feature Selection and Ensemble Learning Model Scheme for Glioma Grading with Clinical and Molecular Characteristics

By E. Tasci, Y. Zhuge, Harpreet Kaur, K. Camphausen, A. Krauze. 2022

Published in International Journal of Molecular Sciences

Variables Table

Variable NameRoleTypeDemographicDescriptionUnitsMissing Values
GradeTargetCategoricalGlioma grade class information (0 = "LGG"; 1 = "GBM") N/Ano
GenderFeatureCategoricalGenderGender (0 = "male"; 1 = "female")N/Ano
Age_at_diagnosisFeatureContinuousAgeAge at diagnosis with the calculated number of daysyearsno
RaceFeatureCategoricalRaceRace (0 = "white"; 1 = "black or african American"; 2 = "asian"; 3 = "american indian or alaska native")N/Ano
IDH1FeatureCategoricalisocitrate dehydrogenase (NADP(+))1 (0 = NOT_MUTATED; 1= MUTATED)N/Ano
TP53FeatureCategoricaltumor protein p53 (0 = NOT_MUTATED; 1 = MUTATED)N/Ano
ATRXFeatureCategoricalATRX chromatin remodeler (0 = NOT_MUTATED; 1 = MUTATED)N/Ano
PTENFeatureCategoricalphosphatase and tensin homolog (0 = NOT_MUTATED; 1 = MUTATED)N/Ano
EGFRFeatureCategoricalepidermal growth factor receptor (0 = NOT_MUTATED; 1 = MUTATED)N/Ano
CICFeatureCategoricalcapicua transcriptional repressor (0 = NOT_MUTATED; 1 = MUTATED)N/Ano

0 to 10 of 24

Dataset Files

FileSize
TCGA_GBM_LGG_Mutations_all.csv258.8 KB
TCGA_InfoWithGrade.csv42.7 KB

Reviews

There are no reviews for this dataset yet.

Login to Write a Review
Download (301.8 KB)
1 citations
20796 views

Creators

Erdal Tasci

erdal.tasci@nih.gov

Radiation Oncology Branch (ROB), National Cancer Institute (NCI), National Institutes of Health (NIH), Building 10

Kevin Camphausen

camphauk@mail.nih.gov

Radiation Oncology Branch (ROB), National Cancer Institute (NCI), National Institutes of Health (NIH), Building 10

Andra Valentina Krauze

andra.krauze@nih.gov

Radiation Oncology Branch (ROB), National Cancer Institute (NCI), National Institutes of Health (NIH), Building 10

Ying Zhuge

zhugey@mail.nih.gov

Radiation Oncology Branch (ROB), National Cancer Institute (NCI), National Institutes of Health (NIH), Building 10

License

By using the UCI Machine Learning Repository, you acknowledge and accept the cookies and privacy practices used by the UCI Machine Learning Repository.

Read Policy