Modern medical research is increasingly built on modeling of high-dimensional data. Sparse regression methods, such as the Lasso (Tibshirani, 1996), Generalized Lasso (Tibshirani et al., 2011), Grouped Lasso (Yuan & Lin, 2006), adaptive Lasso (Zou, 2006), and Elastic Net (Zou & Hastie, 2005), have been widely applied to perform estimation and variable selection at the same time. However, high-dimensional data sets often contain less precise measurements of phenotypes than those that might be available in smaller studies. For example, large biobanks often use billing codes from electronic health care records as proxy measures for a physician-made diagnosis. It is well known that applying naïve regression methods to predictor variables that are measured with error can lead to attenuation of effect estimates (Chesher, 1991; Rosenbaum et al., 2010). Analogously, questionnaire data from large cohorts often contain many missing values (Obermeyer & Emanuel, 2016). Removing subjects who are missing at least one measurement can easily lead to removal of most subjects when data are high dimensional.
Many error-in-variables solutions have been proposed. In addition to simple complete case analysis and pairwise deletion, more rigorous methods, such as expectation-maximization algorithms (Dempster, 1977; Schafer, 1997), multiple imputation methods (Buuren, 2011), and full information maximum likelihood estimation (Enders, 2001; Friedman et al., 2010), have been developed, but these computationally expensive methods cannot be easily extended to high-dimensional settings. In contrast, Loh and Wainwright (2011) developed a penalized method for error-in-variables regression. Within a properly chosen constraint radius, a projected gradient descent algorithm will converge to a small neighborhood of the set of all global minimizers, and is promising for variable selection in a high-dimensional setting (Loh & Wainwright, 2011). Nevertheless, proper choice of this constraint radius depends on knowledge of the parameters yet to be estimated (Datta et al., 2017). Hence, Datta and Zou (2017) developed the Convex Conditioned Lasso (CoCoLasso) that does not require prior knowledge of the unknown parameters. The CoCoLasso algorithm is able to correct for both additive measurement error and missing data, and showed a substantial increase in estimation accuracy and stability compared with the naïve Lasso.
However, when the data are only partially corrupted (i.e., some features are free of measurement error), the CoCoLasso still performs estimation for all features in an undifferentiated manner, limiting the implementation of the approach for large feature sets due to the intensive matrix computations required. Such circumstances of partial corruption are common for genetic epidemiology studies based on large genotyped cohorts, where the genotypes are accurately measured by highly reliable high-throughput sequencing or microarrays, but lifestyle or clinical risk factors (except for age and sex) are measured with various types of error. For instance, in the UK Biobank, one of the largest health registries to date, participants had accurately measured hundreds of thousands of single nucleotide polymorphisms (SNPs) with little missing data, but most covariates based on questionnaires or health care records contained missing data (Bycroft et al., 2018). Samples with such corrupted covariates are usually discarded, potentially leading to underuse of information. Therefore, inspired by the CoCoLasso, we propose here a Block coordinate Descent Convex Conditioned Lasso (BDCoCoLasso) algorithm that makes it possible to perform higher-dimensional error-in-variables regressions by separately optimizing estimation of the parameter estimates for uncorrupted and corrupted features in an iterative manner. Our proposal requires the implementation of a carefully calibrated cross-validation strategy. Furthermore, we build in the smoothly clipped absolute deviation (SCAD) penalty (Fan & Li, 2001) in the new algorithm. In simulations, we confirm that our algorithm provides equivalent results to the CoCoLasso, and demonstrates better performance than the naïve Lasso, with increasing benefit as the dimension increases. Although this approach will still encounter computational limitations for many corrupted features, we substantially enlarge the magnitude of problems that can be analyzed with an error-in-variables approach. We demonstrate the potential practical utility of the BDCoCoLasso by deriving covariate-adjusted genetic risk scores to predict body mass index, bone mineral density, and lifespan in a subset of the UK Biobank (Bycroft et al., 2018).
The rest of the manuscript is organized as follows. In Section 2, we briefly review the CoCoLasso method, and then we describe our new version that allows blocks of features with different corruption states—BDCoCoLasso. We describe simulation settings and results in Section 3. Section 4 illustrates the performance of our algorithm on the UK Biobank data.
2 METHODSIn this section, we first review the principles of the CoCoLasso. We then seek to improve its computational efficiency and stability when the covariate matrix is partially corrupted or when different types of measurement error exist simultaneously, by implementing a block coordinate descent algorithm (Rosenbaum et al., 2013). We also implement a SCAD penalty (Fan & Li, 2001) to avoid overshrinkage when some features have strong effects.
2.1 The CoCoLasso Suppose a true covariate matrixAdditive error: , where
represents additive error;
Missing data: , where
or
.
We first consider fixed, and we solve
in the additive error setting, ;
in the missing-error setting, specifically, we define a ratio matrix indicating the presence or absence of data as
We next consider fixed, with a value optimized in the previous step, and we solve
in the additive error setting, and
, where
is a known variance–covariance matrix for features measured with additive error;
in the missing error setting, and
. Here,
represents elementwise division.
We then alternate between the two steps until convergence. Following similar arguments as in Datta et al. (2017), we can ensure that both problems are equivalent to a Lasso problem. The complete optimization procedure is described in Algorithm 1.
Of note, the estimation problem can be defined as finding the global solution for , and our two-step procedure can be seen as equivalent to replacing
by its nearest positive definite matrix,
, in (5). Use of this substitution might not lead to a jointly convex problem. However, since both marginal problems (6) and (7) are convex, and both have suitable properties (i.e., both are strongly convex and smooth), our generalized alternating minimization algorithm can guarantee global minimization (Jain & Kar, 2017; Kelley, 1999).
in the additive error setting, where the additive error is centered to have zero mean, ;
in the missing error setting, where
.
Although either an additive error setting or a missing error setting can be approached in the aforementioned two-step manner, data often contain variables subject to both types of errors. Therefore, we further propose a generalized algorithm that copes with a mixed error setting, described in Supporting Information.
2.3 Implementation of a SCAD penalty For potential application in scenarios where the causal variables are few but have large effect sizes, using the Lasso penalty may lead to overshrinkage (Fan & Li, 2001). To resolve this potential issue, we have also implemented a nonconcave SCAD penalty (Fan & Li, 2001). The SCAD penalty is given byIn principle, the hyperparameter in the SCAD penalty should be estimated through cross-validation. However, the resulting two-dimensional cross-validation would be computationally expensive. Fan and Li (2001) proposed that
should be suitable for many problems, and that the algorithm performance does not improve significantly with
selected by data-driven approaches. We therefore set
in all simulations described below.
In addition to the SCAD penalty, other weighting schemes, such as the minimax concave penalty (Zhang, 2010), could be implemented in the future for improved generalizability.
3 SIMULATION STUDYSimulations were designed to explore the performance of BDCoCoLasso as a function of the number and proportion of corrupted features. Furthermore, we wanted to ensure that our results matched CoCoLasso when both methods could be implemented, that is, for fairly modest , and a single type of error.
Since we anticipate that this algorithm will be useful in large cohorts where , and anticipating multiple associated features with small effect sizes, we simulated more scenarios with
and
. We assigned different fractions of features to be causal (
or
), and created higher dimensionality (
, or
) while sampling
from a standardized normal distribution
.
For the additive error setting, the corrupted design matrix was generated as where
. We explored different
parameters in combination with different fractions (at least
Comments (0)