Sample average treatment effect on the treated (SATT) analysis using counterfactual explanation identifies BMT and SARS-CoV-2 vaccination as protective risk factors associated with COVID-19 severity and survival in patients with multiple myeloma

Study cohort

Our N3C myeloma cohort included patients (both inpatients and outpatients) from contributing sites who have been diagnosed with COVID-19 between January 1st, 2020, till our cut-off date May 16th, 2022, 2022 (N3C release v76). All myeloma patients without COVID-19 encountered during this time period at the contributing sites were also included initially to build the overall myeloma cohort. Historical patient data from January 1st, 2018, were included for each patient from the same health system, wherever available.

Indicator variables

The N3C clinical data set is a limited dataset that includes protected health information that may include dates of service and patient ZIP code. Details regarding data quality and harmonization checks, cohort definitions, and Malignant Neoplastic Disease standard (SNOMED) concept codes used for primary cancer diagnosis have been published earlier. Briefly, Cancer patients within the N3C registry were identified using the SNOMED Code 3633460000 by the Observational Health Data Sciences and the Informatics Atlas tool. For COVID-19 status, we used N3C positive phenotyping guidelines based on concept definitions and logic provided in Supplementary Tables 1A and 1B. For the purpose of this study, we limited our analysis to 30 days before the COVID-19 diagnosis to 30 days after the start of the index encounter. Further, we used available data to calculate indicator variables on the Charlson Comorbidity Index (CCI) adjusted for cancer diagnosis, primary cancer diagnosis, and cancer therapies.

Myeloma diagnosis

International Staging System (ISS) for Multiple Myeloma stage was calculated using the revised guidelines provided by the International Myeloma Foundation (https://www.myeloma.org/international-staging-system-iss-reivised-iss-r-iss) as Stage 1: Alb ≥ 3.5, B2M < 3.5; Stage 2: Everything else (B2M 3.5–5.5, Albumin any); Stage 3: B2M > 5.5 [13].

Myeloma therapies

A list of currently approved and used anti-myeloma therapies was derived from previously published clinical literature. Treatment with standard anti-myeloma chemotherapeutic regimens for each myeloma patient was assessed using a string search of each cancer therapy in the concept name and manually reviewed for correctness. Bone marrow transplantation/BMT (Hematopoietic Stem Cell Transplantation) was identified using SNOMED code 5960049, which included the vocabulary descendants of the SNOMED codes 42537745 (Bone Marrow Transplant present) and 23719005 (Transplantation of Bone Marrow).

Severity and outcome measures

For the purpose of this myeloma patient cohort study, the outcomes of interest were: all-cause mortality (including discharge to hospice) during the index encounter, as well as clinical indicators of severity requiring hospitalization (inpatient/emergency room/intensive care unit/ICU or intensive coronary care unit/ICCU visit), or use of mechanical ventilation (N3C Procedure Concept Set ID 179437741) or extracorporeal membrane oxygenation (ECMO; N3C Procedure Concept Set ID 415149730).

Statistical analysis and data visualization

All the analyses were performed on the Palantir platform on the N3C data enclave. Summary statistics of descriptive analyses have been represented as counts and percentages of categorical variables. The risk of severe and mild outcomes was calculated using multivariate logistic regression analysis. The models were controlled for age group, gender, race and ethnicity, smoking status, vaccination status, treatment, BMT, and CCI variables. Adjusted odds ratios were estimated with 95% Confidence intervals for potential risk factors. All tests were two-sided. Finally, Cox proportional hazard models with time to death from COVID-19 infection were used to calculate the risk of death, adjusted for age group, gender, race and ethnicity, smoking status, vaccination status, treatment, BMT, and CCI for variables. As per N3C policy, counts of <20 were not reported for privacy.

Causal effect analysis

In this study, we performed matched sample analysis to compute the sample average treatment effect on the treated (SATT) as the measure of the causal effect of the top associated risk factors. Regression models are associative in nature and not causal. As an illustrative example, patients who did not receive BMT may have higher associated risk factors such as higher age, diabetes, high-risk cytogenetics, etc. Therefore, it is possible that patients who did not receive BMT are characterized by an inherently higher risk of mortality. While the multivariate regression models control for many covariates of significance, yet, a full causal argument is not possible due to the potential endogenous relationship between mortality risk and BMT status. The same can be stated for many other risk factors in our analysis. Therefore, it is suggested that a comparison be made across ‘matched’ samples, i.e., patients with similar characteristics other than the risk factor of interest. Accordingly, for every risk factor of interest (for example, BMT Status, Vaccination, etc.), we divided the sample into two subsamples, namely, (i) individuals with higher levels of a risk factor, and (ii) individuals with a relatively lower level of the risk factor. Examples include subsamples where individuals received BMT versus did not receive BMT, or individuals who did not receive vaccination, versus individuals who received vaccination. Note that for each risk factor, we did this subsampling separately. For each individual in the high-risk factor group, we used ‘coarsened exact matching (CEM)’ (using cem package in R, please refer to Iacus et al., 2009 [14]) to match them to individuals in the low-risk group. The matching was performed on all covariates except the risk factor of interest. For example, for the BMT status variable, we matched individuals who did not receive BMT with individuals who received BMT on all variables except BMT. In this manner, the effect of other covariates on the outcome variable (mortality) is minimized. Also, please note that in CEM, the categorical covariates are exactly matched, and the continuous covariates are approximately matched on the rough estimate of the quantiles of the continuous covariates. Therefore, each individual in the high-risk group will be matched with a small number (minimum one) of low-risk individuals on all but the risk factor of interest. Then, based on this matched sample, we computed the average difference in mortality rates between the two groups to estimate the Sample Average Treatment Effect (SATT) as explained in the paper. We also used a propensity score-based matching to check for the robustness of our results. The propensity score uses a logistics regression fit on the risk factor of interest to estimate the probability of each individual being in high or low levels of a risk factor. As an illustration, for BMT status, we first estimated a logistic regression model on all covariates to estimate the probability of an individual to receive BMT treatment. Then we grouped the patients on the propensity (probability) to receive BMT or not and compared the mortality within groups of patients with similar propensities. The results of the propensity score are very similar (not reported) to those of the CEM-matched sample analysis.

The design and development of the SATT method are non-trivial and mathematically involved. For details, please refer to Athey and Imbens 2016 [15]. Briefly, let us consider patients \(i\) who received treatment \(T\) (for example, BMT or vaccination). Let \(_\) denote the response (for example, probability of death from COVID-19, referred to as “mortality or discharge to hospice”, adhering to the spirit of using sensitive language around Covid-related mortality) of patient \(i\). The causal effect of the treatment is defined as the difference in the response measure under the condition that the patient received the treatment from the response measure had the patient not received the treatment. Therefore, the causal effect of the treatment on the treated \(_\) is defined as

$$_=_\left(T=1\right)-_\left(T=0\right).$$

However, in observational data that is not experimentally generated, it is often not possible to observe both the response measures under treatment and no-treatment conditions. For example, for a patient in the dataset that received BMT, we only observe the response under treatment \(_\left(T=1\right)\), but we do not observe the response under no-treatment \(_\left(T=0\right)\). Let \(}}_\) denote covariates (such as patient characteristics, disease conditions, etc.) that determine the patients’ likelihood of receiving the treatment. In experimental data, treatments are usually randomized across observation units. However, in observational data, treatments are not usually randomized; rather, treatments are decided based on the covariates that determine both the treatment assignment and the response outcomes. Under the assumption that the treatment assignment is independent of the outcomes given the covariates [15], that is

$$_\perp \left(_\left(_=1\right),\,_\left(_=0\right)\right)}}}_$$

It can be assumed that the response outcome of the patients in the control group can reasonably approximate the response outcome of the patients in the treatment group, given that the patients are matched on the covariates. Therefore, the treatment effect on the treated can be estimated as

$$_=_\left(T=1\right)-_\left(T=0\right)}_\approx _.$$

The sample average treatment effect on the treated is then estimated as

Where \(n\) is the number of patients who received treatment in the empirical estimation sample, we used a propensity score-based matching. First, we estimated a logistic regression model with the treatment status as the response and the covariates such as age, sex, disease stage and all other relevant variables as explanatory to predict the likelihood of patients receiving the treatment. Then we matched the treatment group with the control group patients by choosing the closest predicted likelihood of receiving the treatment. The SATT is then estimated as the sample average of the difference in the response of the treatment and the control groups.

The role of the institutional review board

Prior institutional review board approvals were obtained from respective institutions to access the N3C data. Further, all the authors who had access to N3C data in the Enclave and performed analyses were approved by the N3C data Use Request committee to access the limited use dataset (Level 3).

Comments (0)

No login
gif