Integration of datasets for individual prediction of DNA methylation-based biomarkers

Lothian Birth Cohorts (LBC) of 1921 and 1936 — DNAm quality control

DNAm data was assessed using the Illumina 450k array for 499 individuals from the LBC1936 and 436 individuals from the LBC1921 [14]. Prior to normalisation, each sample went through a number of filtering checks. Probes predicted to cross-hybridise or target a site containing a polymorphism (n = 54,192) were removed [18]. P-values to quantify signal reliability (detection P-values) were computed for each CpG probe. Probes which had more than 1% of samples with a P-value greater than 0.05 were removed (7366 LBC1921 probes were removed, 1495 LBC1936 probes were removed) Individual samples where more than 1% of probes had a detection P-value greater than 0.05 were removed (49 LBC1921 samples, 1 LBC1936 sample removed) Finally, we removed probes with a bead count of less than three in more than 5% of samples (191 LBC1921 probes removed, 362 LBC1936 probes removed). Following quality control, there were 443,339 remaining probes common across all datasets. There were 387 individuals remaining in the LBC1921 cohort and 498 individuals remaining in the LBC1936 cohort. The combined cohort had 885 individuals.

Normalisation methods

Sixteen normalisation methods from the Minfi and WateRmelon packages [7,8,9,10,11,12,13] were applied to LBC1921, LBC1936, and the combined LBC dataset with the pipeline depicted in Fig. 1.

WateRmelon is an R package that implements several QN methods with systematic nomenclature described in [7]. Methods which start with a ‘d’ apply background adjustment (‘n’ indicates no adjustment). The third letter specifies whether between-array normalisation was performed to Type I and II probes separately (‘s’), together (‘t’), or not at all (‘n’). The final letter indicates whether the dye-bias correction was applied to Type I and II probes separately (‘s’), together (‘t’), or not at all (‘n’). A description of the difference between normalisation methods is shown in Additional file 2: Table S2.

Minfi is an R package [19] that implements three additional normalisation techniques, Noob, Funnorm and Subset-quantile within array normalisation (SWAN). Normal-exponential out-of-bound (Noob) is a within-sample background correction method with dye-bias normalisation for DNAm arrays [11]. Noob uses a normal-exponential convolution method to estimate background distributions by measuring non-specific fluorescence based on out-of-band Type I (i.e. probes in the opposite colour channel - Cy3 vs Cy5). Funnorm, a between-sample normalisation method, makes use of 848 internal control probes and out-of-band probes on the Illumina array to estimate 42 summary measures to account for technical variation [12]. The first two principal components of these summary measures are then used as covariates for intensity adjustment. SWAN consists of two steps [8]. The first step takes a subset of probes, defined to be biologically similar based on CpG content, and determines an average quantile distribution from this subset. The second step adjusts the intensities of the remaining probes by linear interpolation onto the distribution of the subset probes.

In addition to the wateRmelon- and Minfi-implemented functions, we applied 3 widely-cited methods in our comparison: BMIQ, peak-based correction and subset quantile normalisation [9, 10, 13]. BMIQ (a within-array method) first fits a 3-state beta mixture model (0%. 50% and 100% methylation) for Type I and Type II probes separately, in which probes are assigned to the state with maximum probability. This is followed by the normalisation of Type II probes to the distributions of Type I probes in the same group. Peak-based correction independently estimates M-value peaks for Type I and Type II probes, followed by rescaling of the Type II assays to match the estimates obtained for Type I assays. Subset quantile normalisation (Tost method), normalises signal from Type II assays based on a set of Type I ‘anchor’ probes, which are considered to be more reliable and stable.

Normalisation assessment metrics

Three previously published performance metrics were considered [7]. Differentially Methylated Regions Standard Error (DMRSE) measures the variation at sites defined as uniparentally methylated regions with an expected β value of 0.5. The standard error is computed by dividing the standard deviation of differentially methylated region β values by the square root of the number of samples. Genotype Combined Standard Error (GCOSE) examines highly polymorphic SNPs which have three genotypes: heterozygous or homozygous with the major or minor allele. This metric clusters observations into the three groups based on genotype and computes a mean-squared error for each cluster, then averages the three means. Finally, the Seabird metric computes the area under the curve (AUC) for a predictor trained on sex differences on the X chromosome, of which one is hypermethylated in females. Each of the normalisation metrics was ranked on each of the three metrics; the ranks were then averaged to compute a mean overall rank.

DNAm predictor of BMI

A DNAm predictor of body mass index (BMI) was derived using elastic net penalised regression (α = 0.5) on 18,413 participants from the Generation Scotland study [20]. The lambda value that minimised the mean error in a 10-fold cross-validation analysis resulted in a weighted linear predictor containing 3506 CpGs (see Additional file 3: Table S3). As the Generation Scotland DNAm resource was generated using the EPIC array, CpGs were the first subset to the 445,962 sites that were common to the 450k array and that passed QC in the LBC analyses. They were further pruned to the 200,000 most variable CpG features (ranked by standard deviation) to avoid a memory allocation error in the elastic net model. R’s biglasso package was used to implement an elastic net regression model [21,22,23]. The input to the model was a 200,000 × 18,413 matrix containing the CpG M-values for each individual. The target variable was the residuals from a linear regression model of log(BMI) adjusted for age, sex and 10 genetic principal components. The distribution of BMI in the two LBC studies and Generation Scotland are presented in Additional file 1: Fig S3.

Prediction and robustness

Predictions of BMI were performed on both LBC datasets and the combined LBC dataset. An individual’s BMI was predicted by weighting their CpG values by the CpG weights from the Generation Scotland elastic net model. Overall model prediction performance was evaluated by Pearson’s correlation coefficient.

Prediction robustness measures a normalisation method’s invariance to datasets being normalised independently, or jointly with another dataset. Robustness was calculated as the median absolute difference between the independent and joint predictions across all individuals. The goal is to identify how the test datasets behave when predictions are made using data normalised jointly or separately. Small median differences indicate normalisation methods that provide similar outputs irrespective of the data being normalised separately or together. Normalisation methods with large median absolute differences result in inconsistent predictions depending on whether new individuals are normalised jointly with previous data or not.

View original article

GENOME BIOLOGY

Like

Share Bookmark

0 0 0 0 0 0 0

More from this channel

Integration of datasets for individual prediction of DNA methylation-based biomarkers

Comments (0)