libuplift.datasets.Gusto#
The GUSTO-I clinical trial dataset.
This is a randomized clinical trial of thrombolytic strategies for acute myocardial infractions.
This version is from the predtools R package. See: https://cran.r-project.org/web/packages/predtools/index.html for details.
The study results were originally published in https://www.nejm.org/doi/10.1056/NEJM199309023291001
The specific dataset used here together with a logistic model is described in https://www.ahajournals.org/doi/full/10.1161/01.cir.91.6.1659 and in the book https://www.clinicalpredictionmodels.org/extra-material/chapter-22
Functions#
|
Load the GUSTO-I clinical trial dataset. |
Module Contents#
- libuplift.datasets.Gusto.fetch_GUSTO(include_location_vars=True, include_ttr=False, data_home=None, download_if_missing=True, random_state=None, shuffle=False, categ_as_strings=False, return_X_y=False, as_frame=False)[source]#
Load the GUSTO-I clinical trial dataset.
Download it if necessary.
This is a randomized clinical trial dataset of thrombolytic strategies for acute myocardial infractions.
WARNING: the ttr attribute breaks randomization, it is exluded by default, can be added using include_ttr=True e.g. to get nonrandom assignment which is easy to control.
This version come from the predtools R package. See: https://cran.r-project.org/web/packages/predtools/index.html for details.
The study results were originally published in https://www.nejm.org/doi/10.1056/NEJM199309023291001
The specific dataset used here together with a logistic model is described in https://www.ahajournals.org/doi/full/10.1161/01.cir.91.6.1659 and in the book https://www.clinicalpredictionmodels.org/extra-material/chapter-22
Variables
- day30 (target)
death within 30 days
- sho
whether cardiac shock was present
- hig
indicator of non-anterior MI location
- dia
diabetes
- hyp
high blood pressure indicator, seems to indicate sysbp >= 100 (5 exceptions)
- hrt
tachycardia (indicator of pulse>80)
- ttr
Time To Relief of chest pain > 1h
- sex
patient’s sex
- Killip
Killip Class (I, II, III, IV)
- age
patient’s age [years]
- ste
mumber of ECG leads with ST Elevation
- pulse
Heart Rate [beats/min]
- sysbp
Systolic Blood Pressure [mmHg]
- miloc
MI Location (Inferior, Anterior, Other)
- height
patient’s height [cm]
- weight
patient’s weight [ckg]
- pmi
previous MI
- htn
history of hypertension
- smk
smoking (never quit current)
- pan
previous angina pectoris
- fam
family history of MI
- prevcvd
previous cardiovascular disease
- prevcabg
previous coronary artery bypass graft surgery
- regl
region (probably country)
- grpl
location code 2, refinement of regl
- grps
location code 3, refinement of regl
- tx
treatment (SK, SK+tPA, tPA)
Changes to the original dataset
reverse the hyp indicator variable such that 1 corresponds to sysbp >= 100
removed tpa an indicator of tPA treatment (can be inferred from tx)
removed ant variable which is an indicator anterior MI (included in miloc)
change pmi to {0,1} binary indicator
subtract 1 from htn to make it {0,1} binary indicator
subtract 1 from pan to make it {0,1} binary indicator
subtract 1 from fam to make it {0,1} binary indicator
Functional dependencies
sho can be inferred from Killip
hrt can be inferred from pulse
hyp can be inferred from sysbp >= 100 (except 5 cases)
hig: there is a functional dependency: hig=0 -> miloc=Anterior
grpl is a refinement regl
grps is a refinement regl
- Parameters:
- include_location_varsboolean, default=True
Should variables describing hospital locations be included. These are categorical variables with large number of levels. The removed variables are regl, grpl, grps
- data_homestring, optional
Specify another download and cache folder for the datasets. By default all scikit-learn data is stored in ‘~/scikit_learn_data’ subfolders.
- download_if_missingboolean, default=True
If False, raise a IOError if the data is not locally available instead of trying to download the data from the source site.
- random_stateint, RandomState instance or None (default)
Determines random number generation for dataset shuffling. Pass an int for reproducible output across multiple function calls.
- shufflebool, default=False
Whether to shuffle dataset.
- categ_as_stringsbool, default=False
Whether to return categorical variables as strings.
- return_X_yboolean, default=False.
If True, returns
(data.data, data.target)instead of a Bunch object.- as_frameboolean, default=False
If True features are returned as pandas DataFrame. If False features are returned as object or float array. Float array is returned if all features are floats.
- Returns:
- datasetdict-like object with the following attributes:
- dataset.datanumpy array
Each row corresponds to the features in the dataset.
- dataset.targetnumpy array
Each value is 1 if a purchase was made 0 otherwise.
- dataset.DESCRstring
Description of the dataset.
- (data, target)tuple if
return_X_yis True