SUPPORT2
Linked on 9/14/2023
This dataset comprises 9105 individual critically ill patients across 5 United States medical centers, accessioned throughout 1989-1991 and 1992-1994. Each row concerns hospitalized patient records who met the inclusion and exclusion criteria for nine disease categories: acute respiratory failure, chronic obstructive pulmonary disease, congestive heart failure, liver disease, coma, colon cancer, lung cancer, multiple organ system failure with malignancy, and multiple organ system failure with sepsis. The goal is to determine these patients' 2- and 6-month survival rates based on several physiologic, demographics, and disease severity information. It is an important problem because it addresses the growing national concern over patients' loss of control near the end of life. It enables earlier decisions and planning to reduce the frequency of a mechanical, painful, and prolonged dying process.
Dataset Characteristics
Tabular, Multivariate
Subject Area
Health and Medicine
Associated Tasks
Classification, Regression, Other
Feature Type
Real, Categorical, Integer
# Instances
9105
# Features
42
Dataset Information
For what purpose was the dataset created?
To develop and validate a prognostic model that estimates survival over a 180-day period for seriously ill hospitalized adults (phase I of SUPPORT) and to compare this model's predictions with those of an existing prognostic system and with physicians' independent estimates (SUPPORT phase II).
Who funded the creation of the dataset?
Funded by the Robert Wood Johnson Foundation
What do the instances in this dataset represent?
The instances represent records of critically ill patients admitted to United States hospitals with advanced stages of serious illness.
Are there recommended data splits?
No recommendation, standard train-test split could be used. Can use three-way holdout split (i.e., train-validation-test) when doing model selection.
Does the dataset contain data that might be considered sensitive in any way?
Yes. There is information about race, gender, income, and education level.
Was there any data preprocessing performed?
No. Due to the high percentage of missing values, there are a couple of recommended imputation values: According to the HBiostat Repository (https://hbiostat.org/data/repo/supportdesc, Professor Frank Harrell) the following default values have been found to be useful in imputing missing baseline physiologic data: Baseline Variable Normal Fill-in Value - Serum albumin (alb) 3.5 - PaO2/FiO2 ratio (pafi) 333.3 - Bilirubin (bili) 1.01 - Creatinine (crea) 1.01 - bun 6.51 - White blood count (wblc) 9 (thousands) - Urine output (urine) 2502 There are 159 patients surviving 2 months for whom there were no patient or surrogate interviews. These patients have missing sfdm2.
Additional Information
Data sources are medical records, personal interviews, and the National Death Index (NDI). For each patient administrative records data, clinical data and survey data were collected. The objective of the SUPPORT project was to improve decision-making in order to address the growing national concern over the loss of control that patients have near the end of life and to reduce the frequency of a mechanical, painful, and prolonged process of dying. SUPPORT comprised a two-year prospective observational study (Phase I) followed by a two-year controlled clinical trial (Phase II). Phase I of SUPPORT collected data from patients accessioned during 1989-1991 to characterize the care, treatment preferences, and patterns of decision-making among critically ill patients. It also served as a preliminary step for devising an intervention strategy for improving critically-ill patients' care and for the construction of statistical models for predicting patient prognosis and functional status. An intervention was implemented in Phase II of SUPPORT, which accessioned patients during 1992-1994. The Phase II intervention provided physicians with accurate predictive information on future functional ability, survival probability to six months, and patients' preferences for end-of-life care. Additionally, a skilled nurse was provided as part of the intervention to elicit patient preferences, provide prognoses, enhance understanding, enable palliative care, and facilitate advance planning. The intervention was expected to increase communication, resulting in earlier decisions to have orders against resuscitation, decrease time that patients spent in undesirable states (e.g., in the Intensive Care Unit, on a ventilator, and in a coma), increase physician understanding of patients' preferences for care, decrease patient pain, and decrease hospital resource use. Data collection in both phases of SUPPORT consisted of questionnaires administered to patients, their surrogates, and physicians, plus chart reviews for abstracting clinical, treatment, and decision information. Phase II also collected information regarding the implementation of the intervention, such as patient-specific logs maintained by nurses assigned to patients as part of the intervention. SUPPORT patients were followed for six months after inclusion in the study. Those who did not die within six months or were lost to follow-up were matched against the National Death Index to identify deaths through 1997. Patients who did not die within one year or were lost to follow-up were matched against the National Death Index to identify deaths through 1997. All patients in five United States medical centers who met inclusion and exclusion criteria for nine disease categories: acute respiratory failure, chronic obstructive pulmonary disease, congestive heart failure, liver disease, coma, colon cancer, lung cancer, multiple organ system failure with malignancy, and multiple organ system failure with sepsis. SUPPORT is a combination of patients from 2 studies, each of which lasted 2 years. The first phase concerns 4,301 patients, whereas the second phase concerns 4,804 patients. Time wise, these studies were accessioned in 1989 (June 12) through 1991 (June 11) for phase I and in 1992 (January 7) through 1994 (January 24).
Has Missing Values?
No
Introductory Paper
By The SUPPORT Principal Investigators. 1995
Published in In the Journal of the American Medical Association, 274(20):1591–1598
Variables Table
Variable Name | Role | Type | Demographic | Description | Units | Missing Values |
---|---|---|---|---|---|---|
id | ID | Integer | no | |||
age | Feature | Continuous | Age | Age of the patients in years | years | no |
death | Target | Continuous | Death at any time up to National Death Index (NDI) data on 31 of December of 1994. Some patients are discharged before the end of the study and are not followed up. The authors looked up the information about death. | no | ||
sex | Feature | Categorical | Sex | Gender of the patient. Listed values are {male, female}. | no | |
hospdead | Target | Binary | Death in hospital | no | ||
slos | Other | Continuous | Days from Study Entry to Discharge | no | ||
d.time | Other | Continuous | Days of follow-up | no | ||
dzgroup | Feature | Categorical | The patient's disease sub category amogst ARF/MOSF w/Sepsis, CHF, COPD, Cirrhosis, Colon Cancer, Coma, Lung Cancer, MOSF w/Malig. | no | ||
dzclass | Feature | Categorical | The patient's disease category amongst "ARF/MOSF", "COPD/CHF/Cirrhosis", "Cancer", "Coma". | no | ||
num.co | Feature | Continuous | The number of simultaneous diseases (or comorbidities) exhibited by the patient. Values are ordinal with higher values indicating worse condition and chances of survival. | no |
0 to 10 of 48
Additional Variable Information
Class Labels
According to the HBiostat Repository (https://hbiostat.org/data/repo/supportdesc, Professor Frank Harrell) the following tasks have been found to be useful for education purposes: - Binary classification: Hospital death - Ordinal regression: The functional disability of the patient (variable sfdm2) on a 5 points scale (with 5 being the most severely disabled), was measured 2 months after study entry through patient or surrogate interviews. It uses the Sickness Impact Profile (SIP), a behavioral-based measure of health status. The variable has 5 levels mapped as follows: 1: No signs of moderate to severe functional disability from the interview. 2: Patient was unable to do 4 or more activities of daily living 3: Sickness Impact Profile total score at 2 months is greater or equal to 30. 4. Patient intubated or in coma 5: Patient died before 2 months after study entry For more detailed on the used scale, refer to https://www.sciencedirect.com/science/article/pii/089543569090224D?via%3Dihub - Regression Can predict the total hospital costs per patient. Can predict the length of stay for the patients.
Reviews
There are no reviews for this dataset yet.
pip install ucimlrepo
from ucimlrepo import fetch_ucirepo # fetch dataset support2 = fetch_ucirepo(id=880) # data (as pandas dataframes) X = support2.data.features y = support2.data.targets # metadata print(support2.metadata) # variable information print(support2.variables)
Harrel, F. (1995). SUPPORT2 [Dataset]. UCI Machine Learning Repository. https://doi.org/10.3886/ICPSR02957.v2.
Citations/Acknowledgements
If you use this dataset, please follow the acknowledgment policy on the original dataset website.
Creators
Frank Harrel
fh@fharrell.com
Department of Biostatistics