Table 1 summarizes the overview of each step in our framework—pragmatic subgroup discovery. We also demonstrate an applied example of pragmatic subgroup discovery using registered data from the Look AHEAD (Action for Health in Diabetes) trial [17].
Table 1 The overview of pragmatic subgroup discoveryPDR framework of interpretabilityThe PDR framework defines the interpretability of machine learning models by 3 metrics: (i) predictive accuracy, (ii) descriptive accuracy, and (iii) relevancy [14]. First, predictive accuracy refers to the degree to which the statistical model captures the relationships that researchers aim to understand [14]. Second, descriptive accuracy is defined as the extent to which the interpretation of analysis results objectively reflects the relationships computed from statistical models [14]. Third, relevancy indicates that the insights derived from statistical models are described in a manner that the human audience/decision-makers can reasonably understand [14]. Thus, in the context of HTE analysis with enhanced interpretability, researchers anticipate (i) a statistical model to capture the underlying heterogeneities in the counterfactual framework (predictive accuracy), (ii) the interpretation method to unfailingly summarize the trends of heterogeneity derived from the model (descriptive accuracy), and (iii) the discovered heterogeneities to be presented in a practically relevant format (relevancy).
Step1: CATE estimationThe first step is the estimation of CATEs. The aim of this step is to let the chosen statistical model explore the complexity of the input data through the estimation of CATEs. Any modeling approac could be applied in this stage; for example, meta-learners, including causal forest and Bayesian causal forest, are popular methods in CATE estimation [2, 3]. As each modeling has its own strengths and weaknesses, researchers might decide which models could be most suitable in the given dataset and domain knowledge. Details and comparisons of meta-learners can be found elsewhere [18,19,20]. Most importantly, this step of CATE estimation can be done using covariates in any format. For example, researchers can retain a continuous covariate in the model as is rather than categorizing it to model a non-linear relationship. In these cases, the modified covariates might not be in a directly interpretable form. The key idea is to guide the chosen model to capture the underlying heterogeneity through a reasonable model fit to achieve the first metric of interpretability—predictive accuracy. One could describe this model fitting process as a “model-driven” estimation of CATEs.
To assess potential overfitting and the failure of convergence in the statistical model, it is crucial to examine whether the model fits the input data successfully, although some algorithms inherently have features to minimize overfitting, such as splitting samples into training and test data (i.e., honesty), cross-fitting, and Bayesian regularization [2, 21, 22]. Here we introduce some techniques to assess the calibration performance. First, the Qini curve exhibits the frequency of outcome in the treatment arm compared with the control arm which was estimated to have similar levels of CATEs [23]. The x-axis sorts all samples from lowest to highest estimated CATEs and shows the proportion of the samples (from 0 to 100%). The y-axis represents the cumulative difference in the outcome occurrence between treatment and control arms. With a reasonable calibration performance of CATE, we anticipate that the cumulative difference to be higher than in a scenario where samples are sorted at random. The Qini coefficient quantifies this difference between the Qini curve and the random ordering. Additionally, C-for-benefit is the statistic that represents the probability of concordance between the predicted and observed benefit in samples [24]. The Qini coefficient > 0 and C-for-benefit statistic > 0.5 provide evidence of calibration that is better than by chance, respectively [25]. Alternative approaches for the calibration test include best linear predictor analysis and non-parametric rank-consistency test [6, 21, 22, 26].
Step2: subgroup discoveryThe second step aims to identify subgroups characterized by a set of interpretable variables. In this paper, we refer to these covariates as interpretable covariates. While the first step primarily focuses on fitting a statistical model to the input data to estimate the CATE, in this stage, researchers synthesize interpretable covariates by selecting and defining covariates in a way where the identified subgroups are reasonably related to practical decision-making. For example, 3 age subgroups (e.g., < 40, 40–59, and ≥ 60) could be more helpful for clinicians to consider certain treatment options for patients with diabetes, compared to proposing somewhat arbitrary age thresholds determined based on a purely data-driven analysis of a continuous age. In such a case, researchers might choose to include pre-specified age subgroups in interpretable covariates rather than using age on a continuous scale. The selection and modification of covariates in this step are important because if researchers choose to remove or reduce the information in CATE estimation, statistical models might fail to incorporate information that was contributing to the underlying heterogeneity, limiting the predictive accuracy. Of note, researchers might apply distinct classifications of variables for different study objectives even if they use an identical dataset. Moreover, when some covariates such as propensity score for the treatment could be informative but challenging to connect to a practical sense, one might exclude them from a list of interpretable covariates. The selection of interpretable covariates could be considered as a reduction of data complexity driven by principled and subject-matter knowledge to ensure the last element of interpretability—relevancy.
One example of simple approaches to identifying subgroups using interpretable covariates is classification and regression trees (CART) using the estimated CATEs [2, 27]. CART is an algorithm that divides data into subgroups by creating decision trees. In each split of input data, CART recursively synthesizes criteria that maximize the homogeneity of a targeted value [27]. In our discovery framework, CART divides samples into subgroups based on if-then rules, which resemble human decision-making processes, using the estimated CATE and chosen set of covariates [15]. The identified subgroups are defined with mutually exclusive sets of characteristics; for example, if gender and age were applied, CART might divide samples into (i) male & age < 40, (ii) male & ≥40, (iii) female & age < 40, and (iv) female & age ≥ 40) [28]. Researchers can control the number of criteria used to characterize subgroups by setting the depth of the splitting in CART; the higher depth of data splitting results in subgroups defined with more covariates. The direct use of CATE estimates from the chosen statistical models in CART helps to enhance the second metric of interpretability—descriptive accuracy. By sorting CATE with interpretable covariates, this step integrates the estimates derived from statistical modeling with practically meaningful classifications without limiting the roles of each. Alternatively, one can apply other tree models to synthesize subgroups [15, 16, 29,30,31,32].
After subgroup discovery using CART, one can assess whether the computed decision rules reasonably reflect the effect heterogeneity by estimating the ATE of each subgroup by separate regression models. Code examples are provided on GitHub (https://github.com/Toshi934/Interpretability/blob/main/Simulation.R).
Example: pragmatic subgroup discovery in the Look AHEAD trialIn this section, we demonstrate the practical utility of the proposed two-step framework for pragmatic subgroup discovery using registered data of the Look AHEAD trial [17]. In brief, Look AHEAD is a randomized trial of individuals with diabetes to compare intensive lifestyle intervention and diabetes support and education [17]. After treatment assignment, participants were followed up for 13.5 years (median follow-up: 9.6 years) [17]. We obtained 4,901 individuals from the Look AHEAD trial from the National Institute of Diabetes and Digestive and Kidney Diseases Repository. From 4,901 individuals, we excluded 304 individuals (6.2%) who were lost to follow-up before developing the outcome within 7 years, resulting in 4,597 analytic samples. Supplementary Fig. 1 illustrates the sample selection flow.
InterventionIndividuals with type 2 diabetes were randomly assigned to either intensive lifestyle intervention (treatment) or diabetes support and education (control). The treatment group received an intervention focused on weight loss through dietary management and physical activity.
OutcomeThe outcome of the present study was the trial’s primary outcome, including the first occurrence of death from cardiovascular causes, non-fatal myocardial infarction, non-fatal stroke, or hospitalization for angina. The outcome event within 7 years after the treatment assignment was assessed. In analysis, the outcome event was reverse coded so that a higher CATE indicates the higher benefit of treatment.
CovariatesWe used 47 covariates from the registered dataset. All covariates were used in CATE estimation. For subgroup discovery, we selected 42 interpretable covariates after categorizing and removing variables used in CATE estimation [33,34,35,36,37,38,39,40,41]. Table 2 summarizes covariates that were modified or removed in subgroup discovery. Of note, the manner of modification and removal might differ based on the purpose and audience of the analysis. See Appendix 1 for rationates of each definition. Missing data was imputed using a random forest via the package missRanger [42].
Table 2 Coding of interpretable covariatesStatistical analysisFirst, we applied a machine learning approach called Bayesian causal forest (BCF) to estimate CATEs of intensive lifestyle intervention on the reduction of primary outcomes on a scale of risk difference [RD] [2]. Details of BCF can be found in Appendix 2. Using all 47 covariates, we grew 300 regression trees for the 2 BART functions to construct a BCF model. We applied all samples to build the model to avoid potential variability in the results [43]. The model was trained through 300 iterations and 300 burn-in. To account for the potential selection bias due to the attrition within 7 years of follow-up, BCF was adjusted with inverse probability of censoring weighting (IPCW) conditional on the treatment and the 47 covariates [46]. To assess the calibration performance of the trained BCF model, we plotted the Qini curve and estimated the Qini coefficient and C-for-benefit.
Second, to identify subgroups associated with HTE, we built a CART model for predicting the estimated CATEs using the 42 interpretable covariates [27]. We performed 2 node splitting to characterize interpretable subgroups with 2 if-then rules based on their CATEs. To compare the result using interpretable covariates, we additionally performed the subgroup discovery using all 47 covariates without defining interpretable covariates (i.e., categorization). After subgroup discovery, we evaluated whether each interpretable subgroup reflects the effect heterogeneity identified in the first step. For simplicity, we computed group-specific ATE of the intervention via augmented inverse probability weighting (AIPW) using the estimated CATE in each interpretable subgroup.
Comments (0)