Publications

Machine Learning Methods for Healthcare

Abstract. State-of-the-art clustering algorithms provide little insight into the rationale for cluster membership, limiting their interpretability. In complex real-world applications, the latter poses a barrier to machine learning adoption when experts are asked to provide detailed explanations of their algorithms’ recommendations. We present a new unsupervised learning method that leverages Mixed Integer Optimization techniques to generate interpretable tree-based clustering models. Utilizing a flexible optimization-driven framework, our algorithm approximates the globally optimal solution leading to high quality partitions of the feature space. We propose a novel method which can optimize for various clustering internal validation metrics and naturally determines the optimal number of clusters. It successfully addresses the challenge of mixed numerical and categorical data and achieves comparable or superior performance to other clustering methods on both synthetic and real-world datasets while offering significantly higher interpretability.

Abstract. Missing data is a common problem in longitudinal datasets which include multiple instances of the same individual observed at different points in time. We introduce a new approach, MedImpute, for imputing missing clinical covariates in multivariate panel data. This approach integrates patient specific information into an optimization formulation that can be adjusted for different imputation algorithms. We present the formulation for a K-nearest neighbors model and derive a corresponding scalable first-order method med.knn. Our algorithm provides imputations for datasets with both continuous and categorical features and observations occurring at arbitrary points in time. In computational experiments on three realworld clinical datasets, we test its performance on imputation and downstream predictive tasks , varying the percentage of missing data, the number of observations per patient, and the mechanism of missing data. The proposed method improves upon both the imputation accuracy and downstream predictive performance relative to the best of the benchmark imputation methods considered. We show that this edge is consistently present both in longitudinal and electronic health records datasets as well as in binary classification and regression settings. On computational experiments on synthetic data, we test the scalability of this algorithm on large datasets, and we show that an efficient method for hyperparameter tuning scales to datasets with 10, 000’s of observations and 100’s of covariates while maintaining high imputation accuracy.

Abstract. Tree-based models are increasingly popular due to their ability to identify complex relationships that are beyond the scope of parametric models. Survival tree methods adapt these models to allow for the analysis of censored outcomes, which often appear in medical data. We present a new Optimal Survival Trees algorithm that leverages mixed-integer optimization (MIO) and local search techniques to generate globally optimized survival tree models. We demonstrate that the OST algorithm improves on the accuracy of existing survival tree methods, particularly in large datasets.

Prescriptive Analytics for Personalized Medicine

Abstract. Current clinical practice guidelines formanaging Coronary Artery Disease (CAD) account for general cardiovascular risk factors. However, they do not present a framework that considers personalized patient-specific characteristics. Using the electronic health records of 21,460 patients, we created data-driven models for personalized CAD management that significantly improve health outcomes relative to the standard of care. We develop binary classifiers to detect whether a patient will experience an adverse event due to CAD within a 10-year time frame. Combining the patients’ medical history and clinical examination results, we achieve 81.5% AUC. For each treatment, we also create a series of regression models that are based on different supervised machine learning algorithms. We are able to estimate with average R^2= 0.801 the outcome of interest; the time from diagnosis to a potential adverse event (TAE). Leveraging combinations of these models, we present ML4CAD, a novel personalized prescriptive algorithm. Considering the recommendations of multiple predictive models at once, the goal of ML4CAD is to identify for every patient the therapy with the best expected TAE using a voting mechanism. We evaluate its performance by measuring the prescription effectiveness and robustness under alternative ground truths. We show that our methodology improves the expected TAE upon the current baseline by 24.11%, increasing it from 4.56 to 5.66 years. The algorithm performs particularly well for the male (24.3% improvement) and Hispanic (58.41% improvement) subpopulations. Finally, we create an interactive interface, providing physicians with an intuitive, accurate, readily implementable, and effective tool.

Abstract. In the midst of the COVID-19 pandemic, healthcare providers and policy makers are wrestling with unprecedented challenges. How to treat COVID-19 patients with equipment shortages? How to allocate resources to combat the disease? How to plan for the next stages of the pandemic? We present a data-driven approach to tackle these challenges. We gather comprehensive data from various sources, including clinical studies, electronic medical records, and census reports. We develop algorithms to understand the disease, predict its mortality, forecast its spread, inform social distancing policies, and re-distribute critical equip- ment. These algorithms provide decision support tools that have been deployed on our publicly available website, and are actively used by hospitals, companies, and policy makers around the globe.

Abstract. The COVID-19 pandemic has prompted an international effort to develop and repurpose medications and procedures to effectively combat the disease. Several groups have focused on the potential treatment utility of angiotensin-converting–enzyme inhibitors (ACEIs) and angiotensin-receptor blockers (ARBs) for hypertensive COVID-19 patients, with inconclusive evidence thus far. We couple electronic medical record (EMR) and registry data of 3,643 patients from Spain, Italy, Germany, Ecuador, and the US with a machine learning framework to personalize the prescription of ACEIs and ARBs to hypertensive COVID-19 patients. Our approach leverages clinical and demographic information to identify hospitalized individuals whose probability of mortality or morbidity can decrease by prescribing this class of drugs. In particular, the algorithm proposes increasing ACEI/ARBs prescriptions for patients with cardiovascular disease and decreasing prescriptions for those with low oxygen saturation at admission. We show that personalized recommendations can improve patient outcomes by 1.0% compared to the standard of care when applied to external populations. We develop an interactive interface for our algorithm, providing physicians with an actionable tool to easily assess treatment alternatives and inform clinical decisions. This work offers the first personalized recommendation system to accurately evaluate the efficacy and risks of prescribing ACEIs and ARBs to hypertensive COVID-19 patients.

Abstract. Due to its prevalence and association with cardiovascular diseases and premature death, hypertension is a major public health challenge. Proper prevention and management measures are needed to effectively reduce the pervasiveness of the condition. Current clinical guidelines for hypertension provide physicians with general suggestions for first-line pharmacologic treatment, but do not take patient-specific characteristics into account. In this study, longitudinal Electronic Health Record (EHR) data are utilized to determine the optimal antihypertensive treatment for a patient using his or her individual characteristics and clinical condition. Given the observational nature of the data, we address potential confounding through generalized propensity score evaluation and optimal matching. We use multiple machine learning algorithms to estimate counterfactual predictions for a patient under each treatment option and then apply a voting mechanism among the different models to recommend a treatment based on the best expected outcome. We report results on both the unmatched version of the dataset and the matched dataset. We obtain final out-of-sample R^2 values of 0.60 [95% CI, 0.56-0.64] and 0.55 [95% CI, 0.52-0.59] on the unmatched and matched data, respectively. The final R^2 metric is based on instances for which the treatment suggested by the algorithm matches the patient’s actual treatment, thereby allowing us to know the ground truth outcome for comparison. For patients for whom the algorithm recommendation differs from the standard of care, we demonstrate an approximate 15% decrease in next blood pressure based on the predicted outcome under the recommended treatment. Additionally, we develop an interactive dashboard to be used by physicians as a clinical support tool.

Predictive Analytics for Healthcare Data

Abstract. Current stroke risk assessment tools presume the impact of risk factors is linear and cumulative. However, both novel risk factors and their interplay influencing stroke incidence are difficult to reveal using traditional additive models. The goal of this study was to improve upon the established Revised Framingham Stroke Risk Score and design an interactive Non-Linear Stroke Risk Score. Leveraging machine learning algorithms, our work aimed at increasing the accuracy of event prediction and uncovering new relationships in an interpretable fashion. A two-phase approach was used to create our stroke risk prediction score. First, clinical examinations of the Framingham offspring cohort were utilized as the training dataset for the predictive model. Optimal Classification Trees were used to develop a tree-based model to predict 10-year risk of stroke. Unlike classical methods, this algorithm adaptively changes the splits on the independent variables, introducing non-linear interactions among them. Second, the model was validated with a multi-ethnicity cohort from the Boston Medical Center. Our stroke risk score suggests a key dichotomy between patients with history of cardiovascular disease and the rest of the population. While it agrees with known findings, it also identified 23 unique stroke risk profiles and highlighted new non-linear relationships; such as the role of T-wave abnormality on electrocardiography and hematocrit levels in a patient’s risk profile. Our results suggested that the non-linear approach significantly improves upon the baseline in the c-statistic (training 87.43% (CI 0.85–0.90) vs. 73.74% (CI 0.70–0.76); validation 75.29% (CI 0.74–0.76) vs 65.93% (CI 0.64–0.67), even in multi-ethnicity populations. The clinical implications of the new risk score include prioritization of risk factor modification and personalized care at the patient level with improved targeting of interventions for stroke prevention.

Abstract. Accurate, automated extraction of clinical stroke information from unstructured text has several important applications. ICD-9/10 codes can misclassify ischemic stroke events and do not distinguish acuity or location. Expeditious, accurate data extraction could provide considerable improvement in identifying stroke in large datasets, triaging critical clinical reports, and quality improvement efforts. In this study, we developed and report a comprehensive framework studying the performance of simple and complex stroke-specific Natural Language Processing (NLP) and Machine Learning (ML) methods to determine presence, location, and acuity of ischemic stroke from radiographic text. We collected 60,564 Computed Tomography and Magnetic Resonance Imaging Radiology reports from 17,864 patients from two large academic medical centers. We used standard techniques to featurize unstructured text and developed neurovascular specific word GloVe embeddings. We trained various binary classification algorithms to identify stroke presence, location, and acuity using 75% of 1,359 expert-labeled reports. We validated our methods internally on the remaining 25% of reports and externally on 500 radiology reports from an entirely separate academic institution. In our internal population, GloVe word embeddings paired with deep learning (Recurrent Neural Networks) had the best discrimination of all methods for our three tasks (AUCs of 0.96, 0.98, 0.93 respectively). Simpler NLP approaches (Bag of Words) performed best with interpretable algorithms (Logistic Regression) for identifying ischemic stroke (AUC of 0.95), MCA location (AUC 0.96), and acuity (AUC of 0.90). Similarly, GloVe and Recurrent Neural Networks (AUC 0.92, 0.89, 0.93) generalized better in our external test set than BOW and Logistic Regression for stroke presence, location and acuity, respectively (AUC 0.89, 0.86, 0.80). Our study demonstrates a comprehensive assessment of NLP techniques for unstructured radiographic text. Our findings are suggestive that NLP/ML methods can be used to discriminate stroke features from large data cohorts for both clinical and research-related investigations.

Abstract. Background: Timely identification of COVID-19 patients at high risk of mortality can significantly improve patient management and resource allocation within hospitals. This study seeks to develop and validate a data-driven personalized mortality risk calculator for hospitalized COVID-19 patients. Methods: De-identified data was obtained for 3,927 COVID-19 positive patients from six independent centers, comprising 33 different hospitals. Demographic, clinical, and laboratory variables were collected at hospital admission. The COVID-19 Mortality Risk (CMR) tool was developed using the XGBoost algorithm to predict mortality. Its discrimination performance was subsequently evaluated on three validation cohorts. Findings: The derivation cohort of 3,062 patients has an observed mortality rate of 26.84%. Increased age, decreased oxygen saturation (≤ 93%), elevated levels of C-reactive protein (≥ 130 mg/L), blood urea nitrogen (≥ 18 mg/dL), and blood creatinine (≥ 1.2 mg/dL) were identified as primary risk factors, validating clinical findings. The model obtains out-of-sample AUCs of 0.90 (95% CI, 0.87-0.94) on the derivation cohort. In the validation cohorts, the model obtains AUCs of 0.92 (95% CI, 0.88-0.95) on Seville patients, 0.87 (95% CI, 0.84-0.91) on Hellenic COVID-19 Study Group patients, and 0.81 (95% CI, 0.76-0.85) on Hartford Hospital patients. The CMR tool is available as an online application at covidanalytics.io/mortality_calculator and is currently in clinical use. Interpretation: The CMR model leverages machine learning to generate accurate mortality predictions using commonly available clinical features. This is the first risk score trained and validated on a cohort of COVID-19 patients from Europe and the United States.

Abstract. Background: Current Society of Thoracic Surgery (STS) risk models for predicting outcomes of mitral valve surgery (MVS) assume a linear and cumulative impact of risk factors. We evaluated post-operative MVS outcomes and designed mortality and morbidity risk calculators using machine learning algorithms. Methods: Data from the STS Adult Cardiac Surgery Database for MVS was used from 2008-2017. The data included 383,550 procedures with over 300 risk factors, including demographic, and preoperative variables. Logistic Regression (Log. Reg), Random Forest (RF), Optimal Classification Trees (OCT), and eXtreme Gradient Boosting (XGBoost) were employed to train models in order to predict postoperative outcomes for MVS patients. Each model’s discrimination and calibration performance were validated using unseen data against the STS risk score. Results: Comprehensive mortality and morbidity risk assessment scores were derived from a training set of 287,662 observations. The AUC for the mortality task ranged from 0.77 to 0.83, leading to a 3% increase in predictive accuracy compared to the STS score. Log.Reg and XGBoost achieved the highest AUC for predicting prolonged ventilation (0.82) and deep sternal wound infection (0.78 and 0.77) respectively. XGBoost performed the best with an AUC of 0.815 for the renal failure task. For the prediction of permanent stroke all models performed similarly with an AUC around 0.67. The models for mortality, prolonged ventilation, and renal failure had improved calibration performance compared to the STS calculator. Conclusions: The proposed risk models could help health care providers as well as patients more accurately assess a patient’s risk of morbidity and mortality when undergoing MVS.

Algorithmic Insurance

Abstract. As machine learning algorithms start to get integrated into the decision-making process of companies and organizations, insurance products will be developed to protect their owners from risk. We introduce the concept of algorithmic insurance and present a quantitative framework to enable the pricing of the derived insurance contracts. We propose an optimization formulation to estimate the risk exposure and price for a binary classification model. Our approach outlines how properties of the model, such as accuracy, interpretability and generalizability, can influence the insurance contract evaluation. To showcase a practical implementation of the proposed framework, we present a case study of medical malpractice in the context of breast cancer detection. Our analysis focuses on measuring the effect of the model parameters on the expected financial loss and identifying the aspects of algorithmic performance that predominantly affect the price of the contract.

In Preparation