Source code for libuplift.datasets.Gusto

"""The GUSTO-I clinical trial dataset.

This is a randomized clinical trial of thrombolytic strategies for
acute myocardial infractions.

This version is from the predtools R package.  See:
https://cran.r-project.org/web/packages/predtools/index.html for
details.

The study results were originally published in
https://www.nejm.org/doi/10.1056/NEJM199309023291001

The specific dataset used here together with a logistic model is
described in
https://www.ahajournals.org/doi/full/10.1161/01.cir.91.6.1659 and in
the book
https://www.clinicalpredictionmodels.org/extra-material/chapter-22

"""

import numpy as np

from .base import _fetch_remote_csv
from .base import RemoteFileMetadata


ARCHIVE = RemoteFileMetadata(
    filename="GUSTO.csv.gz",
    url=('https://github.com/jszymon/uplift_sklearn_data/'
         'releases/download/GUSTO/GUSTO.csv.gz'),
    checksum=('0f9ce4c74769ccc110bebe6a1f8beaee'
              '255c4b49811d8fea2e6eac0397da8bd9'))


[docs] def fetch_GUSTO(include_location_vars=True, include_ttr=False, data_home=None, download_if_missing=True, random_state=None, shuffle=False, categ_as_strings=False, return_X_y=False, as_frame=False): """Load the GUSTO-I clinical trial dataset. Download it if necessary. This is a randomized clinical trial dataset of thrombolytic strategies for acute myocardial infractions. WARNING: the ttr attribute breaks randomization, it is exluded by default, can be added using include_ttr=True e.g. to get nonrandom assignment which is easy to control. This version come from the predtools R package. See: https://cran.r-project.org/web/packages/predtools/index.html for details. The study results were originally published in https://www.nejm.org/doi/10.1056/NEJM199309023291001 The specific dataset used here together with a logistic model is described in https://www.ahajournals.org/doi/full/10.1161/01.cir.91.6.1659 and in the book https://www.clinicalpredictionmodels.org/extra-material/chapter-22 **Variables** day30 (target) death within 30 days sho whether cardiac shock was present hig indicator of non-anterior MI location dia diabetes hyp high blood pressure indicator, seems to indicate sysbp >= 100 (5 exceptions) hrt tachycardia (indicator of pulse>80) ttr Time To Relief of chest pain > 1h sex patient's sex Killip Killip Class (I, II, III, IV) age patient's age [years] ste mumber of ECG leads with ST Elevation pulse Heart Rate [beats/min] sysbp Systolic Blood Pressure [mmHg] miloc MI Location (Inferior, Anterior, Other) height patient's height [cm] weight patient's weight [ckg] pmi previous MI htn history of hypertension smk smoking (never quit current) pan previous angina pectoris fam family history of MI prevcvd previous cardiovascular disease prevcabg previous coronary artery bypass graft surgery regl region (probably country) grpl location code 2, refinement of regl grps location code 3, refinement of regl tx treatment (SK, SK+tPA, tPA) **Changes to the original dataset** - reverse the hyp indicator variable such that 1 corresponds to sysbp >= 100 - removed tpa an indicator of tPA treatment (can be inferred from tx) - removed ant variable which is an indicator anterior MI (included in miloc) - change pmi to {0,1} binary indicator - subtract 1 from htn to make it {0,1} binary indicator - subtract 1 from pan to make it {0,1} binary indicator - subtract 1 from fam to make it {0,1} binary indicator **Functional dependencies** - sho can be inferred from Killip - hrt can be inferred from pulse - hyp can be inferred from sysbp >= 100 (except 5 cases) - hig: there is a functional dependency: hig=0 -> miloc=Anterior - grpl is a refinement regl - grps is a refinement regl Parameters ---------- include_location_vars : boolean, default=True Should variables describing hospital locations be included. These are categorical variables with large number of levels. The removed variables are regl, grpl, grps data_home : string, optional Specify another download and cache folder for the datasets. By default all scikit-learn data is stored in '~/scikit_learn_data' subfolders. download_if_missing : boolean, default=True If False, raise a IOError if the data is not locally available instead of trying to download the data from the source site. random_state : int, RandomState instance or None (default) Determines random number generation for dataset shuffling. Pass an int for reproducible output across multiple function calls. shuffle : bool, default=False Whether to shuffle dataset. categ_as_strings : bool, default=False Whether to return categorical variables as strings. return_X_y : boolean, default=False. If True, returns ``(data.data, data.target)`` instead of a Bunch object. as_frame : boolean, default=False If True features are returned as pandas DataFrame. If False features are returned as object or float array. Float array is returned if all features are floats. Returns ------- dataset : dict-like object with the following attributes: dataset.data : numpy array Each row corresponds to the features in the dataset. dataset.target : numpy array Each value is 1 if a purchase was made 0 otherwise. dataset.DESCR : string Description of the dataset. (data, target) : tuple if ``return_X_y`` is True """ # dictionaries treatment_values = ["SK", "SK+tPA", "tPA"] sex_values = ["male", "female"] Killip_values = ["I", "II", "III", "IV"] miloc_values = ["Inferior", "Anterior", "Other"] smk_values = ["never", "quit", "current"] regl_values = [str(i+1) for i in range(16)] grpl_values = [str(i+1) for i in range(48)] grps_values = [str(i+1) for i in range(121)] # attribute descriptions treatment_descr = [("treatment", treatment_values, "tx")] target_descr = [("target", np.int32, "day30"), ] feature_descr = [('sho', np.int32), ('hig', np.int32), ('dia', np.int32), ('hyp', np.int32), ('hrt', np.int32), ('ttr', np.int32), ('sex', sex_values), ('Killip', Killip_values), ('age', float), ('ste', np.int32), ('pulse', np.int32), ('sysbp', np.int32), ('miloc', miloc_values), ('height', float), ('weight', float), ('pmi', np.int32), ('htn', np.int32), ('smk', smk_values), ('pan', np.int32), ('fam', np.int32), ('prevcvd', np.int32), ('prevcabg', np.int32), ('regl', regl_values), ('grpl', grpl_values), ('grps', grps_values), ] arch = ARCHIVE dataset_name = "GUSTO" remove_vars=[] if not include_ttr: remove_vars += ['ttr'] if not include_location_vars: remove_vars += ['regl', 'grpl', 'grps'] if len(remove_vars) == 0: remove_vars = None ret = _fetch_remote_csv(arch, dataset_name, feature_attrs=feature_descr, treatment_attrs=treatment_descr, target_attrs=target_descr, categ_as_strings=categ_as_strings, return_X_y=return_X_y, as_frame=as_frame, download_if_missing=download_if_missing, random_state=random_state, shuffle=shuffle, total_attrs=27, all_num=False, remove_vars=remove_vars ) if not return_X_y: ret.descr = __doc__ return ret