A review of geospatial exposure models and approaches for health data integration

This section describes the diverse landscape of models used in geospatial exposure science. Table 1 provides an overview of the models and their basic functional forms, such as Y = f(x) + ε.

Table 1 Summary of the model names, general formulation, section number, and equation number.Proximity

Proximity exposure metrics are the most basic form of an exposure assessment because they rely only on the distance between a pollution source and the observed outcome location. Proximity exposure metrics have been used to elucidate the impacts of environmental exposures on human health, including asthma [60], cardiovascular disease [61], and reproductive fertility [62]. From a linear perspective, a proximity model is simply a deterministic covariate:

where X is the deterministic quantity calculated at location p (note the lack of an error term). Here, we describe the most common and easily calculated proximity metrics. Given a distance matrix, dij, the minimum distance is:

where di,⋅ is the i-th row indicating the distance between outcome i and every pollution source. The average distance is:

$$\overline_}=\frac_}\mathop_^_}_$$

(4.1.3)

where nj is the number of pollution sources. Buffer variables are a useful class of proximity metrics. Buffer variables can be calculated for areal, point, and line sources. Summary statistics such as the mean or fraction within a given area around the location of interest can be calculated. Figure 4 illustrates the most common buffer variables.

Fig. 4: Illustration of common buffer variables applied to land cover classification.figure 4

A Within four isotropic, circular buffers with radii Rj, ; (B) Within an aniostropic, wind-rose based buffer common in air quality studies; (C) Within two upstream contributing areas as commonly in water pollution studies.

A proximity metric that incorporates the distance, density, and potential emissions is the sum of the exponentially decaying contribution of point sources, [37]:

$$_=_^_\exp \left(\!-\frac_}\right)$$

(4.1.4)

where Xi is the quantity at location i, C0j is an initial value such as concentration or emissions at source j, dij is the distance between site i and source j, r is the exponential decay range, and J is the number of sources.

Proximity metrics provide a simple exposure assessment model for point, line, and grid data. From an exposure model validation perspective, however, they are limited because there is no observed data to validate the model. For epidemiological studies, proximity metrics are tested directly against health outcome data with model selection or evaluation of multiple models [63]. If monitoring data are available for an exposure of interest, such as a chemical exposure, developing a model that directly predicts the chemical concentration is recommended. Proximity metrics are routinely developed and applied as geographic covariates in other types of exposure models, such as land use regression models.

Land-use regression

Briggs et al. [64] is largely credited with developing land-use regression (LUR) as a method for estimating air pollution exposure. Coincidentally, a similar method for estimating nutrient loads in river reaches was also introduced in the same year [65]. Land-use regression is simply a linear or nonlinear regression model with spatially-referenced, geographic covariates. The most common linear LUR is:

$$}}(p)=X(p)\beta +\varepsilon$$

(4.2.1)

where Y(p) are the n × 1 observations for the variable of interest (e.g., PM\(_,N_^\)) with space-time locations s and t, X(p) is an n × k design matrix of k spatial and/or spatiotemporal geographic covariates, β is a k × 1 vector of linear regression coefficients, and ε is the n × 1 vector of independent and identically distributed (i.i.d.) errors typical of a classical linear regression. Surface water [65] and similar groundwater models [22, 41] use nonlinear regression with linear source terms multiplied by exponential attenuation and transport terms.

A strength of LUR models is their flexibility to include other models and data sources as interpretable geographic covariates. The exposure models discussed in subsequent sections can be included as covariates in LUR models. Another strength of LUR geographic covariates is their flexibility with respect to distance parameters (i.e., distance hyperparameter). The distance hyperparameter is typically unknown, and thus calculating many of the same variables with varying distance hyperparameters is recommended [37, 40]. Model selection or dimension reduction is used to determine the best covariates and corresponding distance hyperparameters and provide insight into the spatial and temporal scales of the process of interest.

LUR models can be used to make exposure predictions at any location where geographic covariates exist. LUR model prediction mean and variance follow the same formulation as standard linear regression models. The prediction mean, \(\hat\), at new location, p*, is:

$$\hat(_)=X(_)\hat$$

(4.2.2)

where \(\hat\) is the estimated coefficient vector in equation (4.2.1). The LUR prediction variance at a new location, p*, depends on the variation in the residuals and the variation of estimating the true mean with the predictions [66]. It follows that the prediction variance is:

$$Var(\hat(_))=Var(X(_)\hat)+Var(\varepsilon )$$

(4.2.3)

The variance of the residuals is the sample variance, \(_^\). Expanding each component of equation (4.2.3), the general formulation can be written as [66]:

$$Var(\hat(_))=_^[1+X(_)^X(p)]}^X_)}^]$$

(4.2.4)

This general formulation for prediction variance is also seen in other regression-based models (Sections “Geographically Weighted Regression” and “Geostatistical Models: Gaussian Processes, Kriging, and BME”).

Geographically weighted regression

Geographically weighted regression (GWR) is an extension of linear regression and LUR models that allows for spatially and/or spatiotemporally varying coefficients [67]. The main principal behind GWR is that the coefficients in spatial models are non-stationary; that is, the properties or values of model parameters vary depending on local conditions. The GWR extension of linear regression can be written as [68]:

$$}}(p)=X(p)\beta (p)+\varepsilon$$

(4.3.1)

where Y(p), X(p), and ε are the spatiotemporally varying outcome, spatiotemporal covariate, and i.i.d. error and are the same as in equation (4.2.1). β(p) are now spatially and temporally referenced. Mathematically, the coefficient estimates can be considered a version of generalized least squares [68] or a random effect model [69]. For the former, the coefficients are:

$$\beta (p)=^W^X(p)\right)}^X^W^Y(p)$$

(4.3.2)

where W(p) is a spatiotemporal weight matrix. Gelfand et al. [69] provide the general specification of the random effects approach for spatially varying coefficients as:

$$\tilde_}(p)=_+_(p)$$

(4.3.3)

which can interpreted as a spatially-varying random adjustment, βk(p), at locations p to the overall slope βk.

While GWR is not as popular as LUR in environmental exposure assessment, it is gaining favor and has been successfully implemented in a few cases. Hu et al. [70] and Van Donkelaar et al. [71] both implemented a GWR for a PM2.5 model that integrated a variety of geospatial covariates. van Donkelaar et al. [72] utilized GWR to estimate PM2.5 chemical composition, such as nitrates, sulfate, and organic matter. Kloog et al. [73] and Kloog et al. [74] developed random-effects models for prediction of PM2.5.

Brunsdon et al. [67] and Fotheringham et al. [68] provided frameworks for estimating the GLS style GWR models, including a spatiotemporally varying weight matrix, W(p). Briefly, choices on bandwidth or the distance at which coefficients are smoothed, must be made. This includes choices on the bandwidth distance and the smoothing function (e.g., inverse-distance, Gaussian kernel). Algorithms are available to estimate these parameters systematically through a cross-validation procedure; however, more flexibility in estimation is directly related to increased computational burden. To simplify choices and computation time, Van Donkelaar et al. [71] weight coefficient estimates according to an inverse-distance from observations. Gelfand et al. [69] provide the details such as the likelihood derivations and Bayesian estimation approaches for the spatially-varying random effects approach including the posterior predictives estimates.

GLS-style GWR models also have a straightforward approach for making exposure predictions at any location where the geographic covariates exist. GWR model prediction mean and variance are similar to LUR, but with modifications for the spatially varying coefficients and weights matrix. The prediction mean, \(\hat\), at new location, p*, is:

$$\hat(_)=X(_)\hat(_)$$

(4.3.4)

where \(\hat(_)\) is the estimated coefficient vector in equation (4.3.1) at location p* computed with (4.3.3). The GWR prediction variance also follows the general formulation of equation (4.2.4). The prediction variance for GWR, which accounts for both estimation of the mean uncertainty and point prediction uncertainty is [75]:

$$Var(\hat_})=_^\left[1+_^_X]}^[^_^X]^_X]}^_^\right]$$

(4.3.5)

where the spatiotemporal index, (\(_}\)), is implied for brevity (i.e., \(\hat(_)\) in equation (4.3.4) is equivalent to \(\hat_}\) in equation (4.3.5)).

Geostatistical models: Gaussian Processes, Kriging, and BME

Geostatistical models are models that contain explicit error terms to model spatial, temporal, or spatiotemporal auto-correlation in the data. In other words, they are a function to interpolate, extrapolate, or smooth the dependent variable. They have a rich history across many scientific and computational fields including forestry [1], geology [3], engineering [76], statistics [4, 24], machine learning [77] and most recently in environmental health [78]. For this reason, there is often confusion in terminology as nominally equivalent methods were developed in parallel among siloed disciplines. For example, the term “Kriging” is most popular in the engineering and public health literature whereas “Gaussian Process” is more often used in the spatial statistics and machine learning literature.

By definition, a Gaussian process (GP) is a collection of random variables, a finite number of which have a joint Gaussian distribution [77]. A GP is defined by a mean, μ(p), and covariance between locations, Σ(p, p*)

$$}}(p)=GP(\mu (p),_(p,_))$$

(4.4.1)

Each location is defined by the marginal Gaussian distribution, and thus the number of parameters in the model increases along with an increase in sample size. Hence, GP theoretically has an infinite parameter space and is considered non-parametric. Σθ is a covariance matrix that is modeled with kernel functions with parameters θ.

Geostatistical models can also be written as a mixed-effect model where the covariance between points is contained in the random effects term:

$$}}(p)=\mu (p)+\eta (p).$$

(4.4.2)

μ(p) can take many forms such as linear, nonlinear, or even ML models such as random forest [79]. Here, μ(p) is the form of a simple linear model, Xβ, and η(p) is an error term, which can decomposed into independent and identically distributed error and spatiotemporally correlated error represented as a GP, η ~ GP(0, Σθ + τ2I). Σθ is a covariance matrix with parameters, θ, that accounts for correlation between spatial and temporal locations. Given that yi has a Gaussian distribution, the vector of space-time observations, Y, has a multivariate Gaussian distribution. Thus, we can utilize the probability distribution function of a multivariate Gaussian density to define the likelihood of equation (4.4.2) as [24]:

$$L(\beta ,\theta ;}})=^| _^\exp \}}-X\beta )}^_^(}}-X\beta )/2\}$$

(4.4.3)

where ∣Σθ∣ is the determinant of the covariance matrix, a positive-definite matrix parameterized by the covariance or kernel function. The choice of covariance or kernel functions is an active area of research, but, in exposure science, stationary, symmetric kernel functions such as exponential, Gaussian (squared-exponential), and Matérn are the most common and are recommended. The squared exponential or Gaussian covariance with variance σ2, length scale (i.e., decay range), parameter r, and distance between locations, d, is one of the simplest and most common choices:

$$K(d| ^,r)=^exp(-^/r)$$

(4.4.4)

Letting σ = 1, note that as d moves toward 0, the correlation moves toward 1: this implies that the covariance and correlation between locations increases as points are closer together, with the rate and overall distance determined by the estimated covariance parameters. The squared exponential also has a special property that ensures functions will be very smooth, or infinitely differentiable.

The Matérn kernel is a generalization of the Gaussian function that introduces a smoothness parameter to control how many times the sample paths can be differentiated. As the smoothness parameter approaches infinity, the squared exponential covariance is recovered. Natural phenomenon tend to have finite differentiability as opposed to infinite differentiability, and the theoretical properties are good [80], so the Matérn covariance is considered an appropriate choice for exposure and health applications.

Bayesian maximum entropy (BME) is a popular geostatistical approach that can be considered an extension of classical geostatistics methods. Like classical geostatistical methods, covariance parameters are estimated based on an empirical estimation of covariance and the approach does not utilize distributional assumptions. The key aspect differentiating BME from classical geostatistics methods is that predictions can include non-Gaussian uncertainty in predictions at new locations. He and Kolovos [81] extensively reviewed BME, including its successful applications in geospatial exposure modeling.

MLE and Bayesian estimation can simultaneously estimate the mean and variance components of equation (4.4.1), which is statistically more optimal than the empirical approach but introduces computational challenges. With a large number of locations, the likelihood, as in equation (4.4.3) becomes computationally difficult to evaluate because the inverse of the covariance matrix Σ is dense. Nearest neighbor approximations as predictive processes [82] and general Vecchia approximations [83] are among the simplest and most effective techniques: correlations between points that are far away from each other are essentially ignored. Nearest neighbor approximations are suitable for MLE or Bayesian inference. A popular approach for efficient Bayesian inference is the integrated nested Laplace approximation (INLA), which uses a stochastic partial differential approximation of a multivariate Gaussian random field and the Laplace approximation for posterior distributions [84]. Moran and Wheeler [85] developed a Gibbs sampling algorithm for rapid Bayesian inference utilizing hierarchical matrix approximations. Combined mean and GP estimation is further discussed in section “Hybrid” since the distinction between some GP and hybrid methods is blurry.

Kriging is often referred to as the explicit step using geostatistical models for prediction at new locations. For the covariance matrix Σ, we notate the dimension representing new predictions locations with subscript * and observations otherwise. The Kriging prediction mean, \(\hat\), assuming a linear mean, is:

$$\hat(_)=X(_)\hat+\left[K(_,p)^I]}^[Y(p)-X(p)\hat]\right]$$

(4.4.5)

where \(X(_)\hat\) is the linear mean at prediction locations, K(p*, p) (e.g., Equation (4.4.4)) is the estimated covariance matrix between prediction locations and observations, K(p, p) is the estimated covariance matrix between observations, τ2I is the independent error added to the K(p, p) diagonal, and [\(Y(p)-X(p)\hat\)] is the residual at the observations. The Kriging prediction mean can be described as the linear regression model mean at prediction locations plus the interpolated residuals of the exposure data observations. The strength of the residual interpolation is based on the covariance parameters. The Kriging/GP prediction variance is the following:

$$Var(\hat(_))=K(_,_)-K(_,p)_p)+^]}^K(p,_)$$

(4.4.6)

where K(p*, p*) is the covariance matrix between prediction locations, and \(K(p,_)=K_,p)}^\) is the transpose of the covariance between prediction locations and observations. The Kriging/GP variance can be described as the total estimated variance at the prediction locations minus the variance from the additional information contributed by the observation residual interpolations.

Machine learning

Machine learning (ML) describes predictive modeling focused on a learning algorithm and out-of-sample prediction generalization [86]. ML methods have fewer assumptions and are highly parameterized and thus more flexible for capturing complex non-linearity. ML methods for geospatial exposure assessment utilize the same geographic covariates as predictor variables such as LUR (section “Land-use regression”), GWR (section “Geographically weighted regression”), and geostatistical models (section “Geostatistical Models: Gaussian Processes, Kriging, and BME”) and can capture non-linear relationships within and across covariates. Nonetheless, great care is needed to estimate a ML model properly as they are often susceptible to over-fitting. Their success in a wide variety of computational applications has led to their adoption in geospatial exposure modeling. A general ML equation is [87]:

$$}}(p)=f(X(p))$$

(4.5.1)

where f( ⋅ ) is a difficult-to-express function and encompasses a wide variety of forms in ML. Prediction for ML models varies, but, in general, ML prediction utilizes parameters from the estimation process and geographic covariates at prediction locations:

$$\hat(_)=f(X(_))$$

(4.5.2)

ML models typically estimate a central tendency and do not have an explicit prediction variance equation; however, bootstrapping techniques provide a straightforward way to determine approximate prediction variance. For details on the methodologies of ML models and algorithms including generalized additive models, tree methods, boosted and additive trees, support vector machines, and neural networks, we refer readers to Hastie et al. [88]. Additionally, Yan [87] summarized and provided examples of ML for chemical safety applications, including geospatial exposure assessments. Here, we discuss the basic equations and properties of two classes of ML methods that have been successfully applied to geospatial exposure modeling: neural networks and ensemble models.

Neural networks

Neural networks, or artificial neural networks (ANN), are inspired by the structure and function of the human brain. Neural networks consist of layers of interconnected nodes, known as artificial neurons, that process and transmit information through the network. The connections between neurons are weighted, and the weights are updated during the training process to improve the accuracy of the network’s predictions. At their simplest, ANN are essentially repeated logistic regression models. However, ANN represent a modern frontier in statistics and ML where improvements and new models abound. More complicated ANN, referred to as deep learning, allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of non-linearity and abstraction. For comprehensive explanations of neural networks and deep learning, we refer the readers to Bishop [89], LeCun et al. [90], and Goodfellow et al. [91]. Additionally, Yan [87] provided an overview of ANN in exposure science applications.

ANN have been successfully applied in geospatial exposure modeling. Di et al. [92] and Di et al. [93] developed highly accurate, annual average PM2.5 and ozone predictions, respectively. They noted that convolutional layers have the attractive property of accounting for spatial autocorrelation and spatial scale hierarchies (i.e., long range vs. short range). Pyo et al. [94] used ANN to predict cyanobacteria algae blooms in surface water, which are important for human and ecological health applications. Müller et al. [95] and Azimi et al. [96] used ANN to predict groundwater quantity and quality, respectively. ANN have improved predictions of social determinants and their associations with health outcomes [97]. Lastly, Weichenthal et al. [98] discussed future research directions for ANN in exposure science applications.

Ensemble methods

Individual geospatial exposure models, no matter how sophisticated, have strengths and weaknesses compared to alternative model choices. For example, one model may capture low concentrations better than high concentrations while another may capture regional variability better than fine, local-scale variability. Ensemble models are a class of ML algorithms based on the simple concept that a large committee of models is better than an individual model. Here, we describe ML methods based on an ensemble of base or weak models. This differs from ensemble models that are combinations of multiple other full models, often referred to as meta-learners, super-learners, or hybrid models. These are discussed in section “Hybrid”.

In an ensemble model, the final prediction is a weighted average of multiple models:

$$Y(p)=_^__(X(p)),^_=1$$

(4.5.3)

where fm is an individual model, m = 1, …, M, and the weights, wm, sum to 1 and are typically estimated through an optimization procedure. If the weights are equivalent, wm = 1/M, then a simple average of base models can be used. If a weighted average is desired, a simple linear model or GAM can serve as the meta-learner.

Hastie et al. [88] extensively explained tree-based ensembles for regression and classification. Briefly, a tree-based model splits data into hierarchical subsets based on certain features and at each split, applies a decision rule to partition the data and fit a simple model such as a constant. The prediction for a new sample is determined by traversing the tree, applying the decision rules at each node until a terminal node is reached, and using the average target value associated with that terminal node as the prediction. Breiman [99] introduced bootstrap aggregating and random forests, where a given data sample is bootstrapped M times (i.e., randomly sampled with replacement). A tree-based model is then fit on a given bootstrap sample, and the final prediction the average of all bootstrap model predictions. Random forest and bootstrapped aggregated models can be efficient as the individual models are easily parallelizable.

An alternative approach to developing ensemble models is to build the final model sequentially with base or weak-learner models, where each model attempts to improve slightly over the previous aggregated models. Informally, in gradient boosting [100], simple models are added in a stage-wise manner optimized via gradient descent on the current model’s residuals:

where Ym is the gradient-boosted model at iteration m, Ym−1 is the previous iteration’s full model, ym is the current base or weak-learner model, and ν is a penalization parameter between 0 and 1 that prevents the algorithm from proceeding too quickly and thus reducing effectiveness.

Ensemble models have been used with great success in geospatial exposure modeling. Random forest has been used in multiple studies to predict groundwater nitrate vulnerability [10, 101,102,103]. Ransom et al. [104] utilized boosted regression trees to predict groundwater nitrate concentrations in the Central Valley aquifer in California, USA. Gradient and extreme gradient boosting have also been used extensively to model spatiotemporal concentration of air pollutants such as PM2.5 [105,106,107]. Zhan et al. [108] added a spatial weighting to gradient boosting and reported better results than without geographic weighting for the spatiotemporal prediction of PM2.5. Lastly, new approaches have added Gaussian processes to random forest [109] and gradient boosting [109] to improve ensemble models with spatial data.

Mechanistic or chemical transport

The exposure models discussed in previous sections are considered statistical models. Conversely, mechanistic or chemical-transport models (CTM) represent a class of models derived from basic principles of physics and chemistry, such as conservation of energy, resulting in a system of partial differential equations [110]. Many CTM used in exposure science are derived from a general advection-dispersion model based on the principle of conservation of mass [111,

Comments (0)

No login
gif