Beyond Composite Indices: Comprehensive Social Determinants Improve Heart Failure Readmission Prediction
This repo contains the code for the article Beyond Composite Indices: Comprehensive Social Determinants Improve Heart Failure Readmission Prediction.
Heart failure 30-day hospital readmission prediction:
python 3.9
imblearn==0.0
joblib==1.2.0
numpy==1.24.4
pandas==2.0.0
pymongo==4.7.0
scikit_learn==1.4.2
shap==0.45.0
tqdm==4.65.0
xgboost==1.7.6The social determinants of health (SDOH) datasets used in this study can be found below:
| Dataset | Number of SDOH variables Used |
|---|---|
| AHRQ SDOHD | 760 |
Run
/data/patient_inclusion.Rmd
to apply study inclusion, exclusion criteria.
Then run
/data/merge_SDOH.Rmd
to merge SDOH data with patients.
The patient dataset is unavailable due to privacy reasons --- however the following commands demonstrate the steps we used to train and evaluate binary classification models (using clinical and public SDOH data):
To train binary classification models on HF 30-day hospital readmission prediction (in file, choose classification algorithm, features):
python classification_driver_nestKfold.pyTo analyze results of the HF models:
python analyze_classification_perf_results.pyTo analyze fairness of the HF models:
python fairness_analyze_results.pyTo get feature importance information (from trained XGBoost models):
python analyze_XGB_SHAP.pyFirst, pull readmission prediction model results from your local MongoDB collections:
- To gather prediction performance values:
python /scripts/analyze_classification_perf_results.py- To gather prediction fairness values:
python /scripts/fairness_analyze_results.py - To gather feature importance values for XGBoost models:
python /scripts/analyze_XGB_SHAP.py- Then, to generate the patient characteristics table, use:
/data/table1_generate.Rmd- To tabulate model performance and fairness run:
/data/calculate_HF_performance.Rmd- To tabulate feature importance run:
/data/plot_HF_SHAP.RmdNote that all SDOH features used from AHRQ SDOHD can be found below in Related Documents.
Rates of missingness for all expanded SDOH variables (i.e., from AHRQ SDOHD) can be found in summary_statistics/
TotalCohort_missing_rates_by_race.csvTotalCohort_missing_rates_by_readmission.csvTotalCohort_missing_rates_by_readmission_black.csvTotalCohort_missing_rates_by_readmission_white.csv
data/
|-- adi-download-2020-tract/
|-- sdi-download-2019-tract/
|-- feat_base.json
|-- feat_column.json
|-- subgroup_cols_fast.json
|-- 2010-18-all-granularities-AHRQ-dict.xlsx
|-- count_features.py
|-- num_unique_SDOH_features.py
|-- calculate_HF_performance.Rmd
|-- merge_SDOH.Rmd
|-- patient_inclusion.Rmd
|-- plot_HF_SHAP.Rmd
|-- table1_generate.Rmd
|-- data_cleaners.R
|-- gen_chars.R
scripts/
|-- analyze_classification_perf_results.py
|-- analyze_XGB_SHAP.py
|-- classification_driver_nestKfold.py
|-- evalHelper.py
|-- fairness_analyze_results.py
|-- fake_patient_data.csv
summary_statistics/
|-- AHRQ_used_county_metadata.csv
|-- AHRQ_used_tract_metadata.csv
|-- missing_rates_final_allstates_modelinput.csv
|-- TotalCohort_missing_rates_by_race.csv
|-- TotalCohort_missing_rates_by_readmission_black.csv
|-- TotalCohort_missing_rates_by_readmission_white.csv
|-- TotalCohort_missing_rates_by_readmission.csv
|-- domains_lists.json