A simulation study of regression approaches for estimating risk ratios in the presence of multiple confounders

Regression approaches for estimating risk ratios

We consider a cohort study of n subjects involving binary outcome Yi (1 for event and 0 for no event), binary exposure Ai (1 for exposure and 0 for no exposure), and column vector of confounders Li for each subject i. Logistic regression is commonly used to control for confounders and assess the influence of exposure for this type of data. The logistic regression model with first-order terms of exposure and confounders is expressed as follows (we assume that the regression models are correctly specified below unless otherwise noted):

$$\text\frac_|_,_;\alpha \right)}_|_,_;\alpha \right)}=_+__+_^_=_^\alpha ,$$

where \(\alpha =_,_,_^\right)}^\) is the unknown parameter vector, and \(_=_,_^\right)}^\). Under this model, the exponentiated exposure coefficient, \(\text\text\text\left(_\right)\), indicates the adjusted odds ratio, which is often interpreted as the risk ratio under the assumption of rare events [1]. Assuming that Yi follows a binomial distribution given Ai and Li, the parameter α is estimated using the standard maximum likelihood estimation method.

To estimate the risk ratio directly, the log-binomial regression model, the linear model of the log-transformed mean, may be straightforward [4]:

$$\textE\left(_|_,_;\beta \right)=_+__+_^_=_^\beta ,$$

where \(\beta =_,_,_^\right)}^\) is the unknown parameter vector. Under this model, the exponentiated exposure coefficient, \(\text\text\text\left(_\right)\), can be interpreted as the adjusted risk ratio without the assumption of rare events. Assuming that Yi follows a binomial distribution given Ai and Li, the standard maximum likelihood estimation method yields a consistent and asymptotically efficient estimator for β. However, in real-world applications, the iterative procedures for log-binomial regression models often fail to converge [5, 6] and were thus excluded from our simulation experiments.

One proposed solution, modified Poisson regression, estimates parameter β of the log-binomial regression model by solving the following estimating equations for β [7]:

$$U\left(\beta \right)=\sum _^_\left\_-\text\left(_^\beta \right)\right\}=0.$$

The estimating equations of the modified Poisson regression are equivalent to the score equations of the Poisson regression. The estimator for the standard error is obtained from the robust sandwich variance estimator:

$$\begin \widehat\left(\widehat\right) =^_\text\left(_^\widehat\right)_}^\right\}}^\left[\sum _^__-\text\left(_^\widehat\right)\right\}}^_}^\right]^_\text\left(_^\widehat\right)_}^\right\}}^. \end$$

The estimator obtained from the estimating equation is consistent and asymptotically normal, albeit without asymptotic efficiency. For rare events, the modified Poisson regression estimators approximate maximum likelihood estimators of log-binomial and logistic regression, and the efficiency loss should be small [22]. Previous simulation results suggest that the modified Poisson regression estimates are generally close to the maximum likelihood counterparts [7, 23]. The modified Poisson regression is also reported to be less sensitive to outliers [24] and less biased when the mean structure is misspecified [25]. Potential predicted probabilities above 1 [26, 27] are not fatal if the analysis aims to estimate the adjusted risk ratio and not the individual predicted probabilities.

Another approach for estimating the risk ratio is regression standardization using logistic regression [13, 14]. Instead of directly interpreting the logistic regression coefficients, the risk ratio among the entire population is calculated based on predicted probabilities estimated from logistic regression, which are constrained to fall between 0 and 1. Using maximum likelihood estimates of logistic regression \(\widehat\), the predicted risk if subject i was exposed is given by

$$}_=\frac\text\text\left(_^\widehat\right)}\text\text\left(_^\widehat\right)},$$

where \(_=_^\right)}^\), and the predicted risk if subject i was not exposed is given by

$$}_=\frac\text\text\left(_^\widehat\right)}\text\text\left(_^\widehat\right)},$$

where \(_=_^\right)}^\). The risk ratio for exposure is computed by taking the ratio of these risks averaged over the population:

$$\widehat\text}=\frac^}_}^}_}.$$

The estimator for the standard error of \(\text\widehat\text}\) is easily obtained using the delta method [28]:

$$\widehat\left(\text\widehat\text}\right)=^\widehat\left(\widehat\right)R,$$

where

$$\beginR= & \frac^}_}\left[\sum _^\frac\text\text\left(_^\widehat\right)}\text\text\left(_^\widehat\right)\right\}}^}_\right]\\ & - \frac^}_}\left[\sum _^\frac\text\text\left(_^\widehat\right)}\text\text\left(_^\widehat\right)\right\}}^}_\right],\end$$

and \(\widehat\left(\widehat\right)\) is the estimated variance-covariance matrix of logistic regression.

Simulation methods

We conducted a simulation study to evaluate statistical performance in the presence of multiple confounders of three approaches for estimating risk ratios: (1) risk ratio interpretation of logistic regression coefficients, (2) modified Poisson regression, and (3) regression standardization using logistic regression. We simulated 270 scenarios (10,000 iterations) with settings varying systematically on sample size (2500, 5000, and 10,000), number of binary confounders (5, 10, and 20), exposure proportion (20% and 50%), risk ratio for exposure (1, 1.3, and 2), and outcome proportion (1%, 2%, 4%, 8%, and 16%). The simulation was carried out using SAS version 9.4 (SAS Institute, Inc.).

Data generation

We generated binary confounders, exposure, and outcome for each of the 2500, 5000, or 10,000 subjects. To create binary confounders, 5-, 10-, or 20-dimensional Gaussian variables with mean 0, variance 1, and pairwise correlations 0.33 were discretized into 0 and 1 at predefined points (Additional file 1: Table S1). The exposure was generated from a logistic regression model with first-order terms of confounders so that the specified exposure proportion (20% or 50%) was achieved on average (Additional file 1: Table S2). The outcome was generated from a log-binomial regression model with first-order terms of exposure and confounders using three different risk ratios for exposure (1, 1.3, or 2). Confounder-outcome associations were weakened for increased confounders so that the maximum possible individual risk did not exceed 1 at any number of confounders (Additional file 1: Table S3). We also conducted additional simulation experiments keeping the same moderate confounder-outcome associations regardless of the number of confounders. The parameter settings and the results of the additional simulation are provided in Additional file 2. The intercept was adjusted so that the specified outcome proportion (1%, 2%, 4%, 8%, or 16%) was achieved on average.

Although in some previous studies of logistic regression, the number of events was fixed across data sets assuming retrospective samplings such as in a case-control study [17, 19], we generated outcomes so that the specified proportion would be achieved only on average assuming prospective samplings such as in a cohort study. The expected number of events and events per confounder can be calculated using the sample size, outcome proportion, and the number of confounders (Table 1).

Table 1 The expected number of events and the expected number of events per confounderAnalytical approaches

For each data set, we obtained the point estimate, standard error, and 95% Wald confidence interval of the log risk ratio for exposure from the three approaches. We performed logistic and modified Poisson regression with first-order terms of exposure and all confounders using the SAS GENMOD procedure. Note that the logistic regression misspecified the mean structure, and the degree of misspecification was larger with higher outcome proportions as more subjects had relatively high risks (Additional file 1: Fig. S1). We left the default settings unchanged for the optimization algorithm, starting values, and convergence criteria. If the algorithm for logistic regression was deemed to have converged based on criteria stated below, the results for the regression standardization were computed using the SAS program written by the authors.

Performance measures

For each scenario, we summarized the results in terms of convergence proportion, bias, Monte Carlo standard error, mean estimated standard error, and 95% confidence interval coverage for the log risk ratio for exposure. Because the software may falsely report convergence, giving invalid parameter estimates [29], the convergence of logistic and modified Poisson regression was evaluated based on the estimated standard errors of coefficients instead of the procedure’s reports. If the estimated standard error of any coefficient was missing, 0, or above 1000, the algorithm was deemed not to have converged. Results from the converged data sets were used to compute the following performance measures of the three approaches. To provide an intuitive understanding of bias, the mean estimated log risk ratio was transformed back to a linear scale. The Monte Carlo standard error was calculated as the standard deviation of the estimated log risk ratios. The mean estimated standard error was calculated as the average of the standard error estimates of the log risk ratio. The 95% confidence interval coverage was calculated as the proportion of estimated confidence intervals covering the true value.

Comments (0)

No login
gif