Investigating the dark-side of the genome: a barrier to human disease variant discovery?

To date, no-one has looked at the potential impact of dark regions on gene discovery, likely in part due to the difficulties of investigating null-findings or the absence of data. The aim of this analysis was to investigate whether dark regions could affect our ability to identify disease-relevant variants, both when fine-mapping genome-wide significant GWAS loci and when performing whole exome (WES) or whole genome (WGS) sequencing studies.

We investigated the overlap between a curated list of dark regions and dark genes from Ebbert et al. [4], against annotated GWAS loci, here-on referred to as Genomic Risk Loci (GRL), for eight different diseases and complex traits: autism spectrum disorders (ASD); schizophrenia (SCZ); body mass index (BMI); bipolar disorder (BD); major depressive disorder (MDD); cholesterol; amyotrophic lateral sclerosis (ALS); and Crohn’s disease. These eight GWAS were taken from the FUMA public database of GWAS studies and each GRL was annotated with summary information for all genes in LD (R > 0.6) with the tagging SNP [5]. Across the eight studies, 33–73% of GRLs contained dark regions (Table 1). The amount of dark sequence within these regions varied from 92 bp (ASD) to more than 1 Mb (SCZ and BMI). Furthermore, 7–20% of the genes at each locus were found to overlap dark regions, with up to 2.5% of these genes having dark-CDS regions (dark protein-coding regions).

Table 1 Summary of dark regions overlapping genome-wide significant loci from GWAS studies

While only a small percentage of GWAS genes are affected by dark-CDS, it is not expected that all genes at each GRL will play a role in disease aetiology, as demonstrated by fine-mapping, pathway analysis and other downstream analyses of GWAS data [6, 7]. To assess their potential functional impact, the genes with dark regions were investigated for enrichment for biologically relevant gene ontology (GO) terms [8]. All eight sets of dark GWAS genes were enriched for GO terms previously associated with their corresponding disease and trait (Table 1, Additional file 1). In particular, the dark genes from the SCZ, BMI and MDD GWAS studies (the GWAS with the greatest number of GRL genes) returned FDR-significant GO terms. For these three datasets, a comparison of the dark GRL genes against the remaining (not-dark) GRL genes further refined the biological relevance of the GO terms identified (p-value < 0.05, but not FDR-significant) (Additional file 1). In summary, GWAS dark genes and dark-CDS genes are enriched for biologically relevant GO terms, suggesting there are biologically relevant genes in regions of the genome significantly associated with disease that are not fully accessible to SRS technology. Therefore, fine-mapping studies may fail because the pathogenic variants are in dark regions and cannot be accessed.

To investigate the impact of dark regions on the discovery of rare variant associations from WES studies we looked at the overlap of dark regions with the protein-coding regions of genes from the Schizophrenia Exome Sequencing Meta-analysis (SCHEMA) consortium and the Autism Exome Sequencing consortium (ASC). Despite the size of the SCHEMA cohort (24,248 cases and 97,322 controls), only ten genes were found by the authors to be significantly associated with SCZ [9]. Of these ten, only TRIO has a partially dark-CDS (CDS 0.4% dark). Extending the search space to include all genes from SCHEMA with p-value < 0.05 (928 genes), 222 had partially dark gene bodies (including non-coding regions and introns); 22 have partially dark-CDS, ten with > 5% dark-CDS. Of these ten, six have supporting evidence from the literature of having a neuro-developmental or psychiatric function (Additional file 1).

Of the 102 putative ASD-associated genes identified by the ASC (FDR < 0.1) [10], four have dark-CDS, with CORO1A and SHANK3 being more than 5% dark (Additional file 1). Of these 102 genes, 101 are annotated by SFARI Gene 3.0 [11] as Score 1 (High Confidence ASD gene), with one gene being Score 2 (Strong Candidate). Across the full set of SFARI genes we found an enrichment of dark regions in Score 2 and Syndromic (ASD with co-morbid phenotypes) genes with ASC q-values > 0.3, suggesting that some candidate genes for ASD may not perform well in genetic association studies due to their gene bodies being partially dark to sequencing (Additional file 1: Fig. S1).

Two examples of dark candidate disease genes from SCHEMA and ASC are SHANK3 and C4B, shown in Fig. 1. SHANK3 is a top hit from ASC, nominally-associated in SCHEMA, and has also been implicated by common variant GWAS for schizophrenia [12]. As can be seen in Fig. 1, the coding regions of SHANK3 are 7.7% dark and WES in particular is unable to identify genetic variants from 5 different exons. Many studies have supported SHANK3’s role in both SCZ and ASD [13,14,15,16]. C4B was also found to be within the nominally-significant SCHEMA gene set and is a SFARI Score 2 gene. Figure 1 shows that C4B is substantially dark (73% dark-CDS), preventing the discovery of genetic variants across most exons. Both C4B and its paralog C4A (also ~ 74% dark-CDS) have been suggested to play a role in SCZ [6, 17,18,19]. These examples support the theory that candidate disease genes overlapping dark regions may contain rare variants that are not accessible to SRS technology and thus are missed when calculating gene-disease associations.

Fig. 1figure 1

Examples of two genes affected by dark regions overlapping their CDS, showing modified browser views of SHANK3 and C4B from GnomAD Browser of human genetic variation (showing the average read depth of both whole exome and whole genome sequencing data); SCHEMA Browser of SCZ associated rare variants (and for SHANK3, the Autism Sequencing Consortium Browser of rare variants). Note for each browser the conspicuous absence of any genetic variants (pathogenic or benign) from low read-depth (dark regions) from exome and whole genome sequencing data

Ebbert et al. [4] showed that dark genes are involved in many diseases including neuropsychiatric disorders. We have confirmed this and given evidence of even more neuro-psychiatric genes affected by dark regions. As this analysis is based on a conservative number of dark regions and dark genes (749 genes) we propose that we have reported the lower- rather than upper-limit of potential disease-associated genes affected by dark regions. However, it should also be noted that the number of dark regions, both within genes and intragenic regions, vary dramatically depending on both the technology and genome build used. Longer read lengths (Illumina 250 bp) have up to 35% less dark regions than shorter read lengths (Illumina 100 bp), as longer reads map more uniquely than shorter reads [4]. GRCh38 appears to have up to three-fold greater proportion of dark regions than GRCh37 for all read lengths, possibly due to the inclusion of alternative contigs and additional halpotypes from heterozygous regions, which increases the amount of non-unique sequences from SRS in the GRCh38 + alternative contigs reference assembly than the GRCh37assembly [4]. Thus both read length and genome reference build appear to be important factors for the proportion of dark regions present in SRS WGS data.

This study makes use of publicly available GWAS data from FUMA. Larger, better-powered GWAS have since been performed for a number of these diseases, identifying an even greater number of GRLs, each likely to also contain dark regions overlapping putative risk genes. Despite these limitations, we have shown that dark regions overlap with genome-wide significant GWAS loci across a range of traits and disorders, affecting as much as 1.3 Mb of sequence under these peaks and that the genes with dark regions are enriched for biologically relevant GO terms, showing they are relevant to disease-risk. Care must be taken when fine-mapping GWAS regions as the causal variants may be located in regions that are dark to SRS and will therefore be missed. A similar issue can be seen when looking at rare variant association studies. From our analysis, dark regions are likely to contribute to missing heritability.

There needs to be greater awareness of the potential effects of dark regions when using SRS to investigate both common and rare genetic variants contributing to disease. Genes of interest maybe partially inaccessible to the technology being used, meaning that variants at these locations cannot be identified using standard protocols. To overcome this, short-read WES and WGS data can be re-analysed using alignment methods specifically developed to correctly align ambiguous reads (such as from camouflaged regions, repetitive sequences, insertions and deletions) and successfully map non-unique sequences which would normally be discarded [4, 20, 21]. Furthermore, long read sequencing technologies (such as PacBio and ONT) have been shown to reduce the amount of dark gene-body regions by up to 77% [1, 3, 4]. The most recent reference assembly, T2T-CHM13, was generated using a combination of PacBio HiFi and Oxford Nanopore ultralong-read sequencing and represents the first complete genome [22], including the 8% of the genome that has remained hidden since the first human reference genome was published in 2000 [23]. LRS could therefore be used to re-investigate dark genes with evidence of disease effects from other studies (such as animal knock-out models, protein expression studies, etc.). However, limitations of LRS technologies need to be addressed before this technology can be generally adopted [24]. LRS is currently more expensive than SRS, though the costs are fast coming down. Library preparation is less forgiving than for SRS as fresh material or even intact cells are recommended to minimise degradation of ultra-long high molecular weight DNA (which also requires specialised DNA isolation protocols). Both PacBio and ONT have higher error rates for SNV detection compared to SRS, though LRS have been shown to be better at calling SNVs in problematic areas [3]. There is less choice of tools for both raw data analysis as well as mapping and variant calling tools for LRS than SR-NGS but are constantly being improved [25].

留言 (0)

沒有登入
gif