Improved prediction and risk stratification of major adverse cardiovascular events using an explainable machine learning approach combining plasma biomarkers and traditional risk factors

Study designs and populations

The UK Biobank is an ongoing population-based prospective cohort study that features extensive phenotypic data and biomarkers measure. Between 2006 and 2010, more than 500,000 participants aged 40–69 years from the general population were recruited at 22 assessment centers across the United Kingdom. Sociodemographic information, lifestyle and environmental factors, medications information and family history were collected through touchscreen questionnaire or verbal interview at recruitment. Pre-existing health and medical conditions were obtained from the baseline questionnaire and interview, as well as hospital records and death registries. Anthropometric measurements and blood sample collection and laboratory measures were taken at recruitment. The protocol for handling and storing blood samples has been detailed in prior publications [3, 6, 18, 19]. For the purpose of our analyses, we excluded individuals with any type of MACE or entirely missing biochemistry marker data at baseline from those randomly selected for NMR-based metabolomic profiling, resulting in an analytic cohort of 229,352 participants.

Conventional predictor and biomarker set

The overview of the study workflow, as well as clinical predictor sets and biomarker sets were presented in Fig. 1. We investigated three sets of conventional clinical predictor variables: Age + Sex, ASCVD and PANEL. Cardiovascular predictors set (ASCVD) were selected according to the cardiovascular risk scores for primary prevention recommended by the ESC and AHA, including the ESC SCORE2 and the AHA-ASCVD score [4, 5]. The PANEL predictor set included a comprehensive range of demographics, lifestyle, physical measurements, medical history, medications and the CVD family history.

Fig. 1figure 1

Study overview. ASCVD: Age, Sex, Smoking, T2DM, SBP, TC, HDL-C; PANEL: Age, Sex, Education, Income, Ethnicity, BMI, Smoking status, Alcohol intake, Physical activity, Fruit consumption, Vegetable consumption, Sleep duration, Waist circumference, Hip circumference, CVD family history, SBP, DBP, T2DM, Hypertension, Blood pressure medication, Insulin, Cholesterol lowering medication. NMR nuclear magnetic resonance, MACE major adverse cardiovascular events, BMI body mass index, SHAP SHapley Additive exPlanations, BRI biomarker risk index, BRS biomarker risk score, SD standard deviation, CVD cardiovascular disease, TC total cholesterol, HDL-C HDL cholesterol, SBP systolic blood pressure, DBP diastolic blood pressure, T2DM type 2 diabetes mellitus, Bioche biochemical markers, Met metabolic markers, Nonov Met high-throughput nuclear magnetic resonance (NMR) metabolic markers, excluding those that overlap with clinical biochemical markers, AUC receiver operating characteristic curve, JMIM minimal joint mutual information maximization, Cor0.95 Spearman correlation coefficients less than 0.95, Bio a total of 26 biochemical markers combined with Cor0.95 of Nonov Met

A list of all clinical predictors applied in this study is presented in Supplementary Table 1 and a list of all metabolomic predictors in Supplementary Table 2. A total of 30 conventional clinical biochemistry markers were measured from blood samples collected at recruitment. After excluding variables with over 20% missing values, substantial sex differences, or substantial variations based on fasting duration, 26 biochemical markers were included in the analysis. Furthermore, a random subset of 274,236 participants was characterized using a high-throughput NMR metabolomics platform developed by Nightingale Health Ltd. The NMR assay covers 170 original biomarkers spanning multiple metabolic pathways including amino acids, fatty acids, glycolysis metabolites, ketone bodies, inflammation markers as well as cholesterol subtypes and lipoprotein lipids in 14 subclasses [3, 12]. For consistency, the same exclusion criteria applied to the conventional clinical biochemistry markers were also applied to the NMR metabolomic biomarkers. Specifically, missing values for any metabolite did not exceed 20%. For the clinical biochemistry markers, only testosterone was excluded due to substantial sex differences (Supplementary Table 3), which led to the information provided by this marker in the SHAP dependence plot being primarily attributed to sex differences (Supplementary Fig. 1). No similar issues were observed with the NMR metabolomic biomarkers.

MACE outcome definition

The new-onset MACE was defined as a composite of the following events: incident myocardial infarction, incident stroke classified as either ischemic or hemorrhagic, death from CVD, and the occurrence of unstable angina, representing a range of serious cardiovascular outcomes that may significantly impact patient prognosis [9, 20]. We investigated the composite endpoint defined as MACE, along with each of its components, derived from hospital episode statistics and national death registries information. Incident and mortality outcomes were determined based on the first documented occurrence of relevant events, identified using the International Classification of Diseases, 10th revision (ICD-10) or ICD-9 codes [13, 21]. The follow-up of hospitalizations ended on October 31, 2022 in England, August 31, 2022 in Scotland, and May 31, 2022 in Wales. The follow-up for the death registry ended on November 30, 2022. Follow-up visits lasted from the initial attendance at assessment center to the earliest date of relevant outcome diagnosis, death, or the latest hospital inpatient data, whichever occurred first.

Statistical analysisData imputation and baseline characteristics summary

For the potential conventional risk factor, we assumed that the missing data occurred at random and conducted multiple imputations using chained equations with random forests [22, 23]. This approach generated five datasets, from which one was randomly selected for subsequent analysis. Missing values in any covariate were less than 20%. To minimize the impact of outliers for the original metabolites, plasma biomarker values exceeding four interquartile ranges from the median were replaced with their corresponding boundary values [12]. Missing values for any metabolite were below 5% and were imputed using a random forest algorithm. The baseline characteristics of the overall population and specific datasets (incident MACE, stroke, death from CVD, myocardial infarction, unstable angina) were presented as median (interquartile range [IQR]) for continuous variables, and number (percentage [%]) for categorical variables.

Biomarkers selection and model predictive performance

To preliminarily identify biomarkers for enhancing prediction and risk stratification of new-onset MACE, we considered several sets of biomarkers, each incorporating different combinations of clinical biochemical and metabolic markers. These sets included: (1) all selected conventional clinical biochemical markers (All Bioche), (2) all measured original NMR metabolic markers (All Met), (3) all selected biochemical markers combined with the top 50 non-overlapping metabolic markers selected based on the area under the receiver operating characteristic curve (AUC) (All Bioche + AUCTop50 of Nonov Met), (4) all selected biochemical markers combined with the top 50 non-overlapping metabolic markers selected based on minimal joint mutual information maximization (JMIM) values (All Bioche + JMIMTop50 of Nonov Met), and (5) all selected biochemical markers combined with non-overlapping metabolic markers having Spearman correlation coefficients less than 0.95 (All Bioche + Cor0.95 of Nonov Met) [24]. The naming of the biomarker sets and their contents is further illustrated in Fig. 1C.

These sets were incorporated into Cox proportional hazards model, and Harrell’s C-index along with 95% confidence interval (CI) was employed to assess model’s discrimination performance. The 95% CI was calculated using the bootstrap method, with resampling performed 1000 times [25]. Non-overlapping metabolic markers were selected using the Filter method from the mlr3 package, which serves as a preprocessing step prior to model training. For the AUCTop50 of Nonov Met set, we applied the flt(“auc”) method to separately calculate the AUC for each non-overlapping metabolite in predicting the target variable. The top 50 metabolites with the highest AUC values were then selected. Similarly, for the JMIMTop50 of Nonov Met, we used the flt(“jmim”) method to separately compute JMIM score for each feature, which indicates the degree of information shared between independent variable and the target variable. The top 50 features with the highest JMIM score were selected. In the Cor0.95 of Nonov Met subset, we performed a correlation-based variable selection approach to identify and eliminate highly correlated predictors. We computed the Spearman correlation matrix among all non-overlapping metabolites and applied a threshold of 0.95 to identify variable pairs with high correlation. Variables with correlation coefficients of 0.95 or higher were considered redundant. To improve computational efficiency, we used an exact method for correlation matrices with fewer than 100 variables and an approximate method for larger matrices. This strategy effectively reduced multicollinearity, ensuring the robustness of our subsequent analyses. Ultimately, 48 metabolites with Spearman correlation coefficients less than 0.95 were retained. Taking into account both performance and complexity of established model, we selected combined “All Bioche + Cor0.95 of Nonov Met” set as biomarkers for the risk stratification of new-onset MACE and each of its individual components. Subsequently, we modeled disease risk for each endpoint using Cox proportional hazards models based on the previously defined conventional clinical predictor sets: Age + Sex, ASCVD, and PANEL. For all sets, the performance of the Cox proportional hazards models was benchmarked against those based on the combinations of these conventional sets with the selected biomarkers sets, which including 26 biochemistry and 48 metabolic markers, making us to assess whether the selected biomarkers could improve the predictive performance of the models [3].

Feature importance and BRI extraction

The selected biomarkers, along with sex, age, and body mass index (BMI), were incorporated as features to develop CatBoost classifiers for predicting incident MACE and each of its components. While the relationship between BMI and vascular health can be complex and potentially non-linear, given the well-established association between BMI and cardiovascular outcomes [21, 26,27,28,29,30], as well as its role as a key indicator of body composition, and its correlation with various metabolic factors [31, 32], we believe BMI should be considered a critical variable in cardiovascular risk prediction. CatBoost is an unbiased boosting algorithm with support for categorical features. It processes data using ordered boosting and permutation-based statistics, which helps minimize prediction shift, reduces overfitting, and improves model generalization. Additionally, CatBoost’s optimized GPU implementation enables faster training on large datasets, making it scalable for large-scale tasks. Empirical results have demonstrated that CatBoost consistently outperforms other gradient boosting models, such as XGBoost and LightGBM, both in terms of accuracy and efficiency. These advantages, combined with its ability to provide interpretable results through SHAP values, make CatBoost a reliable and efficient choice for practical machine learning applications, especially in fields requiring high transparency and accuracy [33,34,35].

For each endpoint, the data was randomly split into training (70%) and validation (30%) sets. Hyperparameter tuning and model fitting were conducted on the training set, while the validation set was used to evaluate model performance and determine optimal hyperparameters. The final model was then fitted on the entire dataset using the identified the optimal hyperparameters, and the SHAP value [17, 33, 36] were calculated to interpret the output of CatBoost models [33,34,35]. SHAP values fairly distribute the “contribution” of each feature to the model's prediction, quantifying how much each feature increases or decreases a specific prediction compared to the average across the dataset. Feature importance was visualized using the SHAP summary plot, which aggregates SHAP values to identify the most influential predictors in the model. The BRI is derived from the probabilities predicted by the CatBoost classifier for each endpoint. Specifically, the CatBoost classifier outputs the risk probabilities for the target variable, where each individual’s risk is expressed as a probability ranging between 0 and 1. These probabilities represent the likelihood that an individual will experience the endpoint of interest. For each endpoint, we categorized individuals into five risk groups based on the quintiles of the BRI and plotted the cumulative risk for individuals in the first, third, and fifth quintile groups.

Notably, the CatBoost model we employed utilizes gradient boosting decision trees, which inherently capture complex relationships between variables through the splitting mechanism. This is particularly useful for handing high-dimensional or nonlinear data. As a result, even with a nonlinear relationship between continuous variables and the target variable, CatBoost can automatically learn these patterns without the need for additional feature engineering [17]. To explore potential non-linear relationships between BMI and new-onset MACE, as well as its components, we conducted sensitivity analyses by fitting restricted cubic splines to a multivariable-adjusted Cox proportional hazards model (RCS-Cox). We evaluated the Akaike Information Criterion (AIC) for models with 3–7 knots and selected the model with 3 knots (at the 10th, 50th, and 90th percentiles) based on the principle of minimizing AIC.

SHAP-based feature interpretation, BRS and risk assessment

We first identified several meaningful variables that were consistently ranked in the top 50 of importance for predicting different types of MACE outcomes using CatBoost classifiers and SHAP analysis. The multivariable Cox proportional hazards model, adjusted for age, sex, and BMI, was then applied to estimate the risk of new-onset MACE and its components for each one standard deviation (SD) change in these variables. Next, SHAP dependence plots were generated to visualize the importance of features, along with their approximate optimal binary thresholds and the direction associated with MACE and each of its components. The X-axis represents the original values of each given feature, while the Y-axis shows the corresponding SHAP value. A SHAP value greater than zero indicates that the feature contributes to a higher probability prediction (e.g., predicting ‘1’); while the SHAP value less than zero drives the prediction toward a lower probability (e.g., predicting'0'). That is to say a SHAP value above zero suggests that the feature may increase the risk of new-onset MACE. Based on whether the SHAP value is greater than or equal to zero, we categorize the feature as a binary variable. Each feature was then binarized based on its estimated optimal threshold. We further applied the Cox proportional hazards model, adjusted for sex, age, and BMI, to evaluate the association between the top 20 features (categorized as greater than or equal to, or less than, their respective optimal thresholds) and the risk of incident MACE and its components, compared to values below the threshold.

Next, we calculated the BRS for MACE and its components for each individual, using the top 20 features and their respective optimal binary thresholds identified through SHAP analysis [16, 37]. The formula for BRS is as follows:

where \(}_}\) represents the BRS for individual i. \(}_}}} = \left\l} \hfill & } \;\;X_ < 0} \hfill \\ \hfill & }\;\;X_ \ge 0 } \hfill \\ \end } \right.\), which denotes the binary risk score assigned to the j-th feature for the i-th individual. The binary risk score for each feature is assigned based on the association with MACE or its components, as identified through CatBoost classifiers and SHAP analysis. The top 20 features identified are used for calculating the BRS. The Xshap,ij is the estimated SHAP value for the j-th feature of the i-th individual.

For a participant, if a feature exhibits a positive association or a trend toward a positive relationship with the target variable (MACE and each of its components), a score of 1 is assigned when the feature's original value is greater than or equal to the binary cutoff value defined in the SHAP dependent plot (in this case, the feature's estimated SHAP value is located in the upper-right portion of the SHAP dependence plot, greater than or equal to 0); otherwise, the score is set to 0. Conversely, if a feature demonstrates a negative association or a trend toward a negative relationship, a score of 1 is assigned when the original feature value is less than the cutoff value (in this case, the feature's estimated SHAP value is located in the upper-left portion of the SHAP dependence plot, greater than or equal to 0); the score is 0 otherwise.

For example, since age is positively associated with the risk of incident MACE, individuals aged 60 years or older (if 60 is the binary cutoff value for age) would be assigned a score of 1, while those younger than 60 years would receive a score of 0. In contrast, if insulin-like growth factor 1 (IGF-1) exhibits a negative association with incident MACE, individuals with IGF-1 levels greater than or equal to the cutoff value would be assigned a score of 0, while those with levels below the cutoff value would receive a score of 1. To illustrate this, suppose an individual has the following binary scores for the top 20 features, feature1 = 1, feature2 = 0, feature3 = 1, feature4 = 1, ⋯feature20 = 0, the BRS for this individual would be:

$$\begin } &= \left( } \right) \, + \, 0\left( } \right) \, + \left( } \right) \,\\& + \left( } \right) \, + \cdots + \, 0\left( }0} \right). \end$$

Study participants were then classified into low-, moderate-, and high-risk groups according to the tertiles of the BRS for further analysis. We calculated both the incidence proportion and incidence density (ID), which represent the absolute risk [38,39,40], for MACE and its components across low-, medium-, and high-risk groups based on the BRS. The number of new-onset cases and the incidence proportion, also referred to as cumulative incidence, were presented as No. of cases (%), calculated using the following formula:

$$ \begin & Incidence\; proportion \\ & \quad = \frac} \\ \end $$

The ID, also referred to as incidence rate, was calculated using the formula:

$$ \begin & Incidence\; density \\ & \quad = \frac} \\ \end $$

Total person-time was defined as the sum of the time periods of observation each person. A multivariable-adjusted Cox proportional hazards model was used to estimate hazard ratios (HR) and 95% CI for incident MACE and each of its components in the moderate- and high-risk groups, with the low-risk group serving as the reference. The model was adjusted for a comprehensive set of covariates, including demographic factors (age, sex, ethnicity, education, and the average household income), lifestyle factors (smoking status, alcohol consumption frequency, weekly minutes of moderate physical activity, total fruit intake, and total vegetable consumption), and health condition and medication history (systolic blood pressure [SBP], diastolic blood pressure [DBP], antihypertensive medication use, insulin use, cholesterol-lowering medication use, the family history of CVD, and the presence of type 2 diabetes mellitus [T2DM] and hypertension at baseline).

Additionally, SHAP dependence plots enable the visualization of interaction effects between features. For each feature, the plot not only shows its individual contribution but also illustrates how its impact on the prediction changes with the values of other features. The color gradient typically represents the interaction with another key feature, such as sex, revealing how the relationship between a given feature and the endpoint varies depending on the values of other features [17]. Furthermore, stratified analyses were conducted by sex using a multivariable-adjusted Cox proportional hazards model. The interactions between stratification factors and each biomarker of interest with respect to the endpoint were assessed using the likelihood ratio test.

All statistical analyses were performed using R software version 4.3.1 (R Development Core Team, Vienna, Austria). All tests in our study were two-sided. P-values were adjusted for multiple comparisons using the Benjamini-Hochberg (BH) method to control the false discovery rate (FDR). The adjusted P-values are reported as P-FDR. A P-FDR of < 0.05 was considered statistically significant.

Comments (0)

No login
gif