A phenome-wide scan reveals convergence of common and rare variant associations

Common variant analysis

The GWAS summary statistics for common variants were obtained from the Neale Lab (http://www.nealelab.is/uk-biobank) [19]. After quality control, variants with minor allele frequency (MAF) ≥ 0.05 were included. Up to 337,199 subjects were available, with the effective sample size varying with the trait. The genome-wide association statistics had been estimated using a linear model with sex and the first 10 principal components as covariates. We mapped the common variants to protein-coding genes. The generalized gene-set analysis approach MAGMA [20] with default settings was implemented to project the SNP-level signal to a gene level signal (Pcommon). The common variant-based heritability was estimated using ldsc applied to the GWAS summary statistics.

Rare variant analysis

The rare variant analysis was performed for a set of phenotypes in the UK Biobank dataset of 450,953 individuals. A total of 263,696 rare variants were annotated using VEP v95, as implemented in Hail with default parameters, and grouped into three annotation categories, including pLoF (high-confidence by LOFTEE), missense|LC (missense variants and variants annotated as low-confidence by LOFTEE), and synonymous. For each category, gene-based burden tests and SKAT-O tests were performed for each of the 19,407 protein-coding genes [21, 22]. The summary-level results were accessed from the Genebass portal through Hail (https://hail.is/) [18]. For each gene, the minimum p-value (Prare) among the six tests (two statistical tests × three annotation categories) was used in downstream analyses.

Estimation of level of convergence on shared effector genes

To estimate the convergence of common and rare variants on shared molecular mediators, variant-level signals were mapped to gene-level signals as described above. For each approximately independent linkage disequilibrium (LD) block [23], p-value-based (the product of Pcommon and Prare) clumping was performed to reduce the complexity of the LD structure. i.e., for each LD block, one gene with the highest association signal was included.

To control for the effect of sample size on the number of significant genes, the top-ranked 100 genes were selected from both common and rare variant-based gene lists as the significant genes. Sensitivity analysis was performed using top 20, top 50, and top 200 genes. Let a, b, c, and d denote the number of genes showing significance from both the common and rare variant-based tests, only from the common variant-based test, only from the rare variant-based test, and neither from the rare nor common variant-based test, respectively. Let \(n=a+b+c+d\) be the total number of genes. Let \(_\), \(_\), \(_\), and \(_\) denote the corresponding probabilities for the 4 sets of genes, respectively.

We define the Common variant and Rare variant Convergence (CORAC) signature, which quantifies the concordance of implicated genes for common and rare variants. The proportionate agreement (\(_}=_+_\)) is estimated using the gene-count statistics:

$$\widehat_\mathrm}=\fracn$$

The expected probability that both common and rare variants show a high-ranking signal at random (\(_ }=_ \times _\)) is estimated as follows:

$$\widehat_ }}=\fracn\times\fracn$$

The expected probability that a gene implicated by neither common nor rare variants shows a high-ranking signal at random (\(_ }=_ \times _\)) is estimated as follows:

$$_}}}=\frac\times \frac$$

The overall random agreement probability \(_}\) is the sum of \(_ }\) and \(_ }\). CORAC is given by the Cohen’s kappa coefficient \(\kappa\):

$$\kappa =\frac_}-_}}_}}$$

(1)

Note the estimator \(\widehat\) is a chance-corrected statistic (via \(_}}}\)). As a measure of agreement, \(\kappa\) is to be contrasted with the Fisher’s exact test and the \(^\) test, which assign the same p-value to perfect agreement or perfect disagreement. Furthermore, odds ratio has a problematic scale; it equals 1 in the case of random agreement and infinity in the absence of error, rendering comparison difficult to interpret.

To quantify the standard error of the estimator and facilitate downstream statistical inference, we performed bootstrap. Alternatively, one can conduct posterior inference from a Bayesian model [24] to quantify the uncertainty. The likelihood is given by:

$$\mathcal=\frac[_]^_]}^_]}^_]}^=\frac[_]^_-_]}^_-_]}^_-_-_]}^$$

Note this likelihood is a function of \(_\), \(_\), and \(_\). The prior on \(_\) And \(_\) can be assumed to be:

$$_ \sim \mathrm(s,\left(1-s\right))$$

$$_ \sim \mathrm(t,\left(1-t\right))$$

where \(0<s,t<1\), respectively. The prior on \(_\) is a uniform distribution. Using the likelihood and the choice of prior, the posterior distribution can then be used to obtain the posterior mean and the credible interval.

The Spearman’s correlation coefficient ρ was then calculated between the CORAC estimate \(\widehat\) and the effective sample size.

In addition, we define a modified statistic CORACmodified, which has some methodological advantages. CORACmodified is less dependent on the prevalence. i.e., the true proportion of associated genes, which may need to be considered in interpreting the agreement rate, allowing comparisons among phenotypes. CORACmodified is given by Gwet’s AC1:

where

$$_ }=2\pi \left(1-\pi \right)\mathrm\pi =\frac_ }=\frac(_+_)$$

The difference between the two convergence coefficients stems from how the adjustment for chance agreement between the common and rare variant signals is implemented (\(_}\) in \(\kappa\) versus \(_ }\) in \(g\)).

Stratified analysis

For a given phenotype, we define statistics that quantify the extent to which rare (common, respectively) variant informed analysis improves our ability to detect genes from the common (rare, respectively) variant analysis. Following the stratified FDR [25] approach for GWAS, we calculated the posterior probability that a gene is null for the rare variants given that the associations from the rare variants and the common variants are at least as significant as the observed associations:

$$\mathrm\left(_}|_}\right)=\frac_(_})_}}(_}|_})}$$

(2)

Here, \(_}\) is the p-value of the gene from the rare variant analysis, \(_}\) is the corresponding p-value from the common variant analysis, \(_(_})\) is the conditional proportion of null genes for the rare variant analysis given that the p-values for the common variant analysis are as small as \(_}\), and \(\mathrm(_}|_})\) is the conditional cumulative distribution function. Similarly, we define the posterior probability \(\mathrm\left(_}|_}\right)\) with \(_}\) and \(_}\) switched in Eq. (2).

This analysis was illustrated using a stratified Q-Q plot. This plot can be used to visualize the degree to which the use of gene-level associations from the rare (common, respectively) variant analysis enhances our ability to detect gene-level associations from the common (rare, respectively) variant analysis. Differential departure from the null across different p-value inclusion threshold criteria quantifies the enrichment due to the prior information independently of the presence of shared subjects in the common and rare variant analyses.

Role of negative selection in convergence

We tested the extent to which negative selection impacts the functional convergence of common and rare variant associations. Negative selection has been proposed as a mechanism for the extreme polygenicity of complex traits characterized by the flattening of heritability across the genome [26]. Negative selection may also induce variant effect size to vary with linkage disequilibrium [27].

We considered a class of genetic architectures consistent with signatures of negative selection [28], i.e., where the allele frequency influences the allelic substitution effect at a causal variant as follows:

$$_ | \left(_ , _\right) \sim \mathcal(0, \mathrm_\left(1-_\right)\right]}^^\!\left/ \!__\right)}\right.\right\}}^)$$

(3)

Here, the constant of proportionality \(\mathrm=\frac_}^}_}}\) is given by the heritability \(_}^\) divided by the number of causal variants \(_}\) and independent of the variant; \(_\) is the allele frequency of the variant \(i\); \(_\) is the LD score; and \(\alpha\) is a signature of selection on the trait linking the allele frequency of \(i\) to the variance of SNP effects. We assume \(r\) is either 0 (which corresponds to a MAF-dependent distribution of effect sizes) or 1 (which corresponds to a MAF- and LD- dependent distribution of effect sizes). The parameters \(_\) and \(_\) can be estimated from an ancestry-matched reference panel. In our framework, the LD score is integrated in the effect size distribution (Eq. 3), rather than downstream in the definition of CORAC, as one model of genetic architecture, through which LD influences the convergence signature. We assume that the genotype is scaled with mean 0 and variance 1. The model (Eq. 3) has been shown to be consistent with what is observed in the UK Biobank, with \(\widehat=-0.37\) [29]. We note that the constant factor \(\mathrm\) is the expected value of the per-SNP heritability under a neutral model (\(\alpha =0\)), in which the causal effect size distribution is independent of allele frequency. An estimate of \(\alpha\) can be obtained by maximizing the profile likelihood [30].

Here, we estimated \(\widehat\) using the approximate joint log likelihood \(\mathrm_}\) that can be calculated from summary statistics [31]. We computed the partial correlation between the convergence level \(\widehat\) (Eq. 1) and \(\widehat\) (Eq. 3) while adjusting for the effective sample size in addition to the Spearman correlation (without the adjustment).

We tested the robustness of the observation concerning the effect of negative selection on the degree of convergence by using another approach to detect the selection signal. We applied a Bayesian mixed model approach [29, 32] to infer the action of natural selection on the genetic variants underlying a phenotype. The approach estimates a parameter \(S\) (with an asymptotic normal approximation to its posterior distribution) representing the relationship between the variance of SNP effects and minor allele frequency using genome-wide SNP data. We then tested the correlation of the estimate \(\widehat\) with the CORAC estimate \(\widehat\).

Simulation framework

We studied the behavior of the convergence level with respect to effective sample size in simulations. Towards this end, using actual (scaled) genotype data, we simulated genetic architectures consistent with negative selection (Eq. 3, with \(r=0\)) for comparison with a baseline class of genetic architectures (in which there is no dependence of the distribution of variant causal effect on allele frequency):

$$_ |\left(_,_\right)\sim \mathcal\left(0,\frac}^2}_}}\right)$$

(4)

We set the following parameters: heritability (0.30), the proportion of causal genes (10%), and the probability \(_ }\) (0.50), i.e., the proportion of shared causal genes between common and rare variant-based signals. We varied the effective sample size (from 1000 to 10,000). For each class of genetic architectures, we generated \(n\) simulations (100) with different seeds for sampling. We generated the phenotype and identified the gene-level signals from the common and rare variants. We investigated the Spearman correlation between \(\kappa\) and sample size for each class of genetic architectures.

To examine the behavior of the correlation before and after decorrelating common and rare variants, we fixed the genotype for common variants and shuffled the genotype for each rare variant across individuals to break any potential correlation. Furthermore, we calculated the convergence level under different degrees of polygenicity by varying the proportion of causal genes across the genome (from 5 to 20%).

Verification using independent data sources for common and rare variants

We further estimated the correlation between the convergence level and effective sample size using independent data sources for the common and rare variant-based signals. This analysis enabled us to evaluate the impact of the use of a shared biobank dataset for the common and rare variant analyses. We used the UK Biobank-free GWAS results from the GWAS ATLAS [33] as the source for the common variant-based associations. The GWAS results generated from any UK Biobank samples were excluded. The WES-based results from the UK Biobank were used as the source for the rare variant-based signals. Only European ancestry samples were included. To harmonize the phenotype data between the GWAS ATLAS and the UK Biobank, the Sentence-BERT [34] word embedding model was implemented [34]. The resulting embeddings from the Transformer-based network were used to search for semantic similarity. Phenotype pairs with cosine similarity less than or equal to 0.75 were filtered out. We then manually confirmed the resulting phenotype pairs and removed duplicated ones. The correlation between CORAC and effective sample size was estimated in this dataset.

留言 (0)

沒有登入
gif