The best linear unbiased prediction (BLUP) method as a tool to estimate the lifetime risk of pancreatic ductal adenocarcinoma in high-risk individuals with no known pathogenic germline variants

PANGENFAM inclusion criteria and data collection

Inclusion criteria: (1) FPC families with ≥ 2 affected first or second degree relatives; (2) Hereditary breast and ovarian cancer (HBOC) families with at least one case of PDAC; (3) Families with ATM mutation and at least one case of PDAC; (4) Familial atypical multiple mole melanoma (FAMMM) families with at least one case of PDAC; (5) Hereditary Non Polyposis Colorectal Cancer (HNPCC) or Lynch Syndrome families with at least one case of PDAC; (6) Peutz Jeghers families; (7) Hereditary Pancreatitis (with pathogenic variants in the genes PRSS1 and SPINK1); and (8) Families with PDAC cases diagnosed at ≤ 50 years of age [20].

Some high-risk individuals underwent routine genetic testing in the clinic for known familial cancer associated genes, including BRCA2, BRCA1, CDKN2A, MLH1, ATM, PALB2, CHEK2 and SPINK. Of the 71 individuals tested, 21 (30%) were positive for a pathogenic variant, most frequently in the BRCA2 gene (57%). The data for the study are stored in a secure sever in a custom designed database in REDCap (13.4.11, 2024 Vanderbilt University). The database used for this study was downloaded on 5 October 2022 and consisted of 4602 individuals from 125 families. Each family has a 9-digit unique identifier, and each individual has a unique 12-digit identifier. The kinship of each individual was used to construct the correlation matrix based on the offspring relationship and off-kindred individuals were excluded. Information available for each individual on clinical history, sex, age, family identifier and generation were selected to estimate variance components and define heritability. The first phase of the analysis included 3780 individuals, 752 cases of any cancer, of which 213 correspond to PDAC.

Generation of the BLUP mixed model

Data from the families within PANGENFAM were used, assuming a genetic component, although the calculated risk encompassed all cancers with a possible genetic component and also those specifically with PDAC. First, we estimated the heritability of all cancer types with the information PANGENFAM pedigrees. The dependent variable was cancer yes or cancer no. Three models were used, adding sequentially a random effect, model I (individual), II (individual + family) and III (individual + family + generation), to which effects were added, taking as a reference the models described in our previous study that used the same model [23]. Model I was a very simple and biologically implausible model and Model III was more biologically plausible. The components of the 3 models are summarised in Table 1.

Table 1 Summary of the 3 mixed models, specifying the fixed and random effects in each case

Subsequently, the dependent variable was refined using just PDAC, yes or no, and applying model III. In this way, the BLUP analysis was used (1) to estimate the heritability of pancreatic cancer and (2) estimate the individual genetic risk of developing pancreatic cancer. The software RStudio [24] was used and several specific packages were employed for further analysis including, “kinship2“ [25], “pedrigreemm” [26], “pedigree” to plot the family trees of each family, the package “MCMCglmm“ [27] for the calculation of the mixed model and the ROCR package [28] to analyse the predictive character of the model obtained.

Statistical methodology for assessing individual risk of developing cancer

For this study, cancer was defined as a phenotypic trait resulting from the additive effect of a large number of genes with a medium-low effect, which supports the hypothesis that this disease has a heritable genetic component. The variable representing the cancer is binary, where a value of 1 is assigned to affected individuals and 0 to unaffected individuals. The usual model to study binary traits is a threshold model. This model assumes a continuous underlying random variable, liability, which when it is over a given threshold, this triggers the expression of one of the binary phenotypes, i.e. cancer or no cancer, and PDAC or no PDAC [29, 30]. The variance for this underlaying normal distribution was set to 1.

For the estimation of risk, the BLUP was calculated using the equations of the Henderson mixed model [31] and the Fisher infinitesimal model [32]. Generalized Linear Mixed Models (glmm) are an extension of the generalized linear models that allow the inclusion of response variables of different distributions, such as binary [27]. The linear mixed model was defined as:

where y is the observed phenotype, β and u are the fixed and random effects respectively, X and Z are matrices and e is the random error. The random effects follow a multivariate normal distribution MVN, u ∼ MV N (0, G) and e ∼ MV B(0, R) where G is the genetic covariance matrix and R the residual. Henderson proposes the following solution to the model:

$$\left[\beginX ^^X& X ^^Z\\ Z ^^X& Z^^Z+^\end\right]\left[\begin\widehat\\ \widehat\end\right]=\left[\beginX^^y\\ Z^^y\end\right]$$

(2)

The Fisher model states that genetic inheritance is based on an infinite number of loci with a small additive effect. The phenotypic variance VF is calculated as the sum of the genotypic variance VG and the environmental variance VE. In turn, the genetic variance is the sum of an additive component VA and a non-additive component VNA, related to dominance or epistasis effects. BLUP allows the calculation of the additive part of this heritability that is transmitted between generations. The heritability is calculated using Fisher’s expression:

The following formula was used to estimate the heritability from the components of the calculation of model III:

$$^=\frac_^}_^+_^+_^}$$

(4)

Where \(\theta\)2, is the variance. Denominator of formula (4) is the phenotypic variance decompose on their components, given that we are using a threshold model. Numerator of formula (4) is the additive component of the variance attributable to individual. The consistency of this estimate of h2 was assessed by testing the null hypothesis of heritability, i.e. h2 = 0 using a Bayes factor. The input parameters for the calculation include the kinship matrix, which establishes the relationships between individuals, and a matrix with the values of each individual for each of the effects included in the models. For the calculation of each of these variances necessary for the estimation, Bayesian inference was used because it is a stochastic calculation using a binary variable. The models were run with 5.25 million iterations, with an initial burn-in of 250 000 iterations and sampling every 2500 iterations, resulting in a sample size of 2000. When the model was run for only PDAC as a dependent variable, a model with 6.25 million iterations, with the same burn-in and thinning interval as previous models lead to a sample size of 2400.

For the a priori distribution, an inverse-gamma distribution with parameter expansion was used, which is that recommended by the author of the calculation package for small variances as indicated in the manual [33]. The residual variance was set to θ2 = 1. Once the models were calculated, an analysis of the convergence of the Markov chain was carried out using the Heidelberg and Welch test [34] to reject or fail to reject the results obtained in the model calculation.

Estimated genetic value calculation

The estimated genetic value (EGV) was obtained as the solution of the individual random effect, by calculating the mean of the 2000 samples obtained in the calculation of the variance associated with each one. The sampling interval chosen was sufficiently wide to reduce autocorrelation. Furthermore, to identify individuals at high risk for cancer based solely on family pedigree information, the individual risk was calculated as the mean of the values of their parents. This is because each individual inherits half of his additive genetic component from the father and half from the mother [31]. Pearson´s correlation coefficient was used to assess the correlation between family incidence of cancer and median EGV of the family. In order to assess differences in EGVs between groups non –parametric Kruskall Wallis test was used.

To assess the predictive ability of this calculation, the area under the Receiver Operating Characteristic (ROC) curve was used to assess the predictive ability of this calculation. Analysis of the ROC curve allows to distinguish between positive and negative cases during prediction, so that an area under the AUC (Area Under Curve) close to 1 indicates that the model has a high predictive ability to identify individuals at risk of developing cancer.

Comments (0)

No login
gif