Reconstructing individual-level exposures in cohort analyses of environmental risks: an example with the UK Biobank

UK Biobank

The UK Biobank cohort is a longitudinal study that has involved adults aged 40–69 at recruitment in the United Kingdom between 2006 and 2010 [8]. Overall, 503,325 participants were recruited and each of them attended an assessment centre and completed questionnaires on their socio-economic aspects, lifestyle factors, and medical history, among other information. They also underwent a wide range of physical measures, as well as the collection of biological samples. The study is periodically enriched with follow-up assessments, new sources of data originating from research projects, and updates from external databases. These comprise the linkage with electronic health records (EHR) and national health system registers, including death and cancer occurrences, hospitalisations and primary care visits. Information on environmental exposures currently available in the UK Biobank is represented by annual averages of air pollutants and noise for single years between 2006 and 2010. Air pollution measures are limited to a sub-group of participants and obtained from Europe-wide land-use regression models [14].

The linkage of new environmental data to cohort participants necessitates three sources of information, exemplified by the pseudo-data illustrated in Table 1. These simulated data are used in this and the next sections to describe the linkage process and epidemiological analyses. The first piece of information is about the baseline cohort information, illustrated in Table 1a. These data are represented here by the enrolment and last follow-up dates for each participant, identified by a pseudo-code. This usually is linked to other information collected at the baseline or during follow-up assessments, such as personal characteristics and socio-economic factors, which are not shown here. The second piece of information concerns the health data, some of which is accessible to UK Biobank researchers through a standard application. For instance, the main database includes inpatient records of the first occurrences of a series of clinical adverse events. An example with pseudo-data is provided in Table 1b, including the same pseudo-IDs of the subject, as well as the ICD-10 codes and dates of the events.

Table 1 Example of pseudo cohort data, including a baseline cohort information, b health outcomes, and c residential histories.

The final piece of information is the residential histories of the subjects. In the UK Biobank, these are limited-access data, represented by the dates and locations of the participants’ residential addresses, where the location represents the centroid of a 1 km and 100 m buffer that contains the exact location. These data were collected during the baseline interview and are ongoingly updated via self-report or new registration to general practices of the National Health Service (NHS). Residential pseudo-data are shown in Table 1c, including pseudo-IDs for subjects and locations, and start/end dates of the period the subject stayed at each address, alongside the corresponding geographical coordinates (in Northing-Eastings values of the British National Grid).

Spatio-temporal exposure maps

Advances in exposure assessment have been achieved through important developments in two areas. First, the increasing availability of data resources with high spatial and temporal resolution and extended coverage, in particular from remote sensing sources. Second, the provision of innovative analytical techniques, for instance, machine learning algorithms or atmospheric and climate models with increasingly better performance and reliability. These technological advancements make it possible to produce fine-scale spatio-temporal maps of environmental exposures applicable in population-based epidemiological studies [15]. These state-of-the-art tools have rapidly substituted classical exposure assessment methods, such as the assignment to the closest monitoring station or traditional land-use regression models, as the latter fail to provide accurate estimates for large areas and over long periods of time [16].

In this contribution, we consider a dataset that is currently used to assign daily exposures to fine particulate matter (PM2.5, in µg/m3) to the participants locations of the UK Biobank. This product was generated by a multi-stage machine learning model that was applied to predict daily PM2.5 concentrations in a 1 × 1 km grid across Great Britain during the period 2008–2018. The model was trained using data from 581 monitoring stations, using a long list of spatial and spatio-temporal predictors including remote sensing satellite observations, traffic data, weather simulations, road characteristics, and land-use information, among others. The model had a good overall performance, with a cross-validated R2 of 0.767. Details are provided elsewhere [16].

This resource is used in the next sections to exemplify the linkage process of PM2.5 measures to participants of the UK Biobank.

Spatial linkage (Step 1)

Geographical information systems (GIS) have become a staple technique for constructing environmental databases. In this context, GIS provide a binding framework between environmental measures and cohort data collected at the individual level, combining different layers of information to a single point in space [17]. These techniques are employed in epidemiological analyses by overlying geographical reference grids over which the investigators can jointly map exposure information with individual or area-level variables. This allows maximising the available information by downscaling or upscaling measurements across levels of aggregation, as well as combining measurements across space and time.

We discuss the application of GIS techniques and related problems by illustrating the linkage of environmental exposures to the UK Biobank. The cohort database includes the locations of the residential addresses of each participant. An example is provided in Fig. 1, which shows the PM2.5 levels for one day from the 1 × 1 km gridded spatio-temporal map presented in the previous section. The map also includes the three residential addresses for Subject 1 listed in Table 1b, and for one address, it adds a magnified detail of the 1 × 1 km cells surrounding the location.

Fig. 1: The maps display PM2.5 levels on a specific day over Great Britain, with three locations (large black dots) that represent the residential addresses of a specific subject (ID 2 in Table 1).figure 1

The magnified area on top represents the exact location at higher resolution, surrounded by the four nearest centroids (small indigo dots) of the overlaid PM2.5 grid. Without interpolation, the residential exposure value (small black dot) would be represented by the value of the nearest centroid. The magnified area below illustrates the process of reconstructing the residential value as a bilinear interpolation of the four nearest centroids.

A simple linkage option is to assign the value of the grid cell containing the location. However, this option has two main drawbacks. First, it does not account for the information of the neighbouring cells, which can complement the cell-level measurement with details on the small-scale variability and improve the exposure assignment. Second, and more importantly, the direct linkage of cell-specific values can result in potential privacy breaches described above by allowing back-tracing of the location using geographic information from the original gridded environmental data, if this is publicly available and at sufficiently high resolution.

In lieu of the simple linkage approach described above, other methods of varying complexity can be used and the choice depends on the type of exposure data and the underlying objective of data linkage. For example, in the presence of ground monitor data, a simple strategy would be to assign exposure as the inverse-distance weighted average of the nearby monitors. For gridded exposure data, established routines such as simple spatial averaging, bilinear and kriging interpolation exist in the two-dimensional case, while more specific methods have been investigated more recently as a consequence of the raise of new forms of spatial data [18]. Here, we propose the use of the bilinear interpolation, which consists of a repeated linear interpolation across the two geographical dimensions and it is graphically represented in Fig. 1. We deem this method to be an effective but simple option, among the others, for several reasons. The process addresses the two drawbacks of the simpler linkage described above: first, it preserves the exposure information by spatially combining measurements across multiple grid cells. Second, and more importantly, it generates a continuous exposure field with values that cannot be linked back to the original sources, preventing the identification of the residential locations even when using highly resolved and public exposure databases. Compared to other interpolation methods, bilinear interpolation does not require a choice of the parameters (e.g., search radius or number of neighbours) and it is more accurate than simple spatial averaging as it accounts for the distances among the points in the computation of the interpolated value [19]. Moreover, its deterministic nature makes it computationally inexpensive even for very large datasets, for instance in comparison to kriging [20]. Finally, bilinear interpolation is commonly implemented in data analysis and geographical software and therefore easy to apply. It must be highlighted that, regardless of the method, the accuracy of this linkage would depend on the spatial resolution of the original exposure data, and the precision of the coordinates for the locations.

Reconstruction of individual-level exposure series (Step 2)

The linkage-interpolation operation in the previous section can be performed for each residential location of each participant of the cohort. The output data, combined with the residential histories, allow reconstructing subject-specific series representing individual exposure profiles.

This step is illustrated in Fig. 2 for Subject 2 in our case study. Specifically, the residential histories of this subject reported in Table 1c, combined with the interpolated series for the three residential locations obtained following the procedure in Fig. 1, allow extracting blocks of exposure series corresponding to the timeline of each subject’s residence at specific addresses. These blocks are then merged into a single individual series that represents a detailed residential exposure profile for an individual, accounting for exposure levels experienced at different locations during a defined time interval.

Fig. 2: The top three series represent the sequences of daily exposures at the residential addresses of subject ID 2.figure 2

At the bottom, the final subject-specific exposure series is assembled by concatenating the three series above based on the respective residential periods.

Definition of individual summaries for epidemiological studies (Step 3)

The reconstruction of exposure profiles in the previous section offers detailed individual-level time series characterised by a fine temporal disaggregation, allowing the definition of various exposure summaries. In epidemiological analyses, this is of particular relevance as such summaries can be flexibly tailored to the specific research questions and study designs, resulting in more informative inferential procedures and reducing exposure misclassification.

The definition of the exposure summaries first requires assumptions on the temporal dependency between exposure and outcomes, determined by underlying biological mechanisms. Two intertwined aspects are particularly relevant: the timescale of the association and the related exposure window. The former differentiates short-term risks associated with daily variation from long-term effects due to chronic exposures experienced over years or decades. The latter determines the maximal temporal interval over which the exposure exerts its action, within a specific timescale.

We use our case study to illustrate the definition of exposure summaries for two different study designs for individual-level data: a survival analysis based on Cox proportional hazard models to assess long-term effects [21], and a case-crossover analysis to investigate short-term associations [22]. The two examples are represented in Fig. 3, using the pseudo-data related to specific health events in Table 1b.

Fig. 3: The graph presents the use of the exposure data in two examples of study designs used in environmental epidemiology.figure 3

The top figure illustrates a risk set within a study on the incidence of lung cancer (ICD-10: C34) with a case (subject 2) and controls matched by age used in a Cox proportional hazard model to estimate long-term risks. The event (aquamarine star) and control (blue star) times are used to reconstruct backwards the exposure profiles in the three subjects, defined as 365-day (lag 0–364) averages of PM2.5 (light blue boxes). The bottom figure displays the same process to define risk sets for a time-stratified case-crossover to estimate short-term risks. The graph shows three separate subjects (unrelated to Table 1) with the event (aquamarine star) and controls (blue star) days matched on the day of the week in the same month, with exposure profiles defined as averages of lag 0–3.

The Cox proportional hazard model is based on a between-subject comparison, defining separate risk sets for each event. Each risk set includes the case subject as well as a series of control subjects who are at risk at the time of the event. An example of a single risk set is shown at the top of Fig. 3. The composition of the risk set depends on the time axis of interest, which in this case is represented by the age of the subjects. The controls are therefore sampled when they reach the same age that the case had when experiencing the event. For each subject, we retrieve their exposure history backwards with a lag period equal to the exposure window, and therefore define the related exposure summary.

A case-crossover design follows a similar extraction procedure. However, in contrast to the survival model above, the latter is based on a within-subject comparison, and the case and controls are represented by different times within the follow-up period of the same subject. Several control sampling schemes have been proposed in the literature [23] with the most common being the time-stratified scheme with controls sampled within pre-specified strata. An example with three subjects representing three separate risk sets with an exposure window of four days (lag 0–3) is provided at the bottom of Fig. 3.

The availability of finely stratified temporal profiles allows higher precision in the definition of the exposure windows, before any potential aggregations are performed. For instance, multiple lag terms can be defined using daily, monthly, or yearly strata, thus allowing the application of distributed lag models over different timescales [24].

留言 (0)

沒有登入
gif