libuplift.datasets.Gusto#

The GUSTO-I clinical trial dataset.

This is a randomized clinical trial of thrombolytic strategies for acute myocardial infractions.

This version is from the predtools R package. See: https://cran.r-project.org/web/packages/predtools/index.html for details.

The study results were originally published in https://www.nejm.org/doi/10.1056/NEJM199309023291001

The specific dataset used here together with a logistic model is described in https://www.ahajournals.org/doi/full/10.1161/01.cir.91.6.1659 and in the book https://www.clinicalpredictionmodels.org/extra-material/chapter-22

Functions#

fetch_GUSTO([include_location_vars, include_ttr, ...])

Load the GUSTO-I clinical trial dataset.

Module Contents#

libuplift.datasets.Gusto.fetch_GUSTO(include_location_vars=True, include_ttr=False, data_home=None, download_if_missing=True, random_state=None, shuffle=False, categ_as_strings=False, return_X_y=False, as_frame=False)[source]#

Load the GUSTO-I clinical trial dataset.

Download it if necessary.

This is a randomized clinical trial dataset of thrombolytic strategies for acute myocardial infractions.

WARNING: the ttr attribute breaks randomization, it is exluded by default, can be added using include_ttr=True e.g. to get nonrandom assignment which is easy to control.

This version come from the predtools R package. See: https://cran.r-project.org/web/packages/predtools/index.html for details.

The study results were originally published in https://www.nejm.org/doi/10.1056/NEJM199309023291001

The specific dataset used here together with a logistic model is described in https://www.ahajournals.org/doi/full/10.1161/01.cir.91.6.1659 and in the book https://www.clinicalpredictionmodels.org/extra-material/chapter-22

Variables

day30 (target)

death within 30 days

sho

whether cardiac shock was present

hig

indicator of non-anterior MI location

dia

diabetes

hyp

high blood pressure indicator, seems to indicate sysbp >= 100 (5 exceptions)

hrt

tachycardia (indicator of pulse>80)

ttr

Time To Relief of chest pain > 1h

sex

patient’s sex

Killip

Killip Class (I, II, III, IV)

age

patient’s age [years]

ste

mumber of ECG leads with ST Elevation

pulse

Heart Rate [beats/min]

sysbp

Systolic Blood Pressure [mmHg]

miloc

MI Location (Inferior, Anterior, Other)

height

patient’s height [cm]

weight

patient’s weight [ckg]

pmi

previous MI

htn

history of hypertension

smk

smoking (never quit current)

pan

previous angina pectoris

fam

family history of MI

prevcvd

previous cardiovascular disease

prevcabg

previous coronary artery bypass graft surgery

regl

region (probably country)

grpl

location code 2, refinement of regl

grps

location code 3, refinement of regl

tx

treatment (SK, SK+tPA, tPA)

Changes to the original dataset

  • reverse the hyp indicator variable such that 1 corresponds to sysbp >= 100

  • removed tpa an indicator of tPA treatment (can be inferred from tx)

  • removed ant variable which is an indicator anterior MI (included in miloc)

  • change pmi to {0,1} binary indicator

  • subtract 1 from htn to make it {0,1} binary indicator

  • subtract 1 from pan to make it {0,1} binary indicator

  • subtract 1 from fam to make it {0,1} binary indicator

Functional dependencies

  • sho can be inferred from Killip

  • hrt can be inferred from pulse

  • hyp can be inferred from sysbp >= 100 (except 5 cases)

  • hig: there is a functional dependency: hig=0 -> miloc=Anterior

  • grpl is a refinement regl

  • grps is a refinement regl

Parameters:
include_location_varsboolean, default=True

Should variables describing hospital locations be included. These are categorical variables with large number of levels. The removed variables are regl, grpl, grps

data_homestring, optional

Specify another download and cache folder for the datasets. By default all scikit-learn data is stored in ‘~/scikit_learn_data’ subfolders.

download_if_missingboolean, default=True

If False, raise a IOError if the data is not locally available instead of trying to download the data from the source site.

random_stateint, RandomState instance or None (default)

Determines random number generation for dataset shuffling. Pass an int for reproducible output across multiple function calls.

shufflebool, default=False

Whether to shuffle dataset.

categ_as_stringsbool, default=False

Whether to return categorical variables as strings.

return_X_yboolean, default=False.

If True, returns (data.data, data.target) instead of a Bunch object.

as_frameboolean, default=False

If True features are returned as pandas DataFrame. If False features are returned as object or float array. Float array is returned if all features are floats.

Returns:
datasetdict-like object with the following attributes:
dataset.datanumpy array

Each row corresponds to the features in the dataset.

dataset.targetnumpy array

Each value is 1 if a purchase was made 0 otherwise.

dataset.DESCRstring

Description of the dataset.

(data, target)tuple if return_X_y is True