Near real-time, portable genomic prediction has a number of applications in agriculture and healthcare. Here, we have demonstrated in cattle that when combined with an appropriate imputation strategy skim-GBS using portable ONT sequencing is a suitable method for genomic prediction, with coverages as low as 0.1 × . Using the imputation packages QUILT and GLIMPSE, GEBV correlations were higher than or equal to correlations from low-density SNP arrays for sequencing coverages as low as 0.1 × and, at this coverage, the performance of animals relative to each other could accurately be separated into quartiles. We have also demonstrated that using large SNP reference panels with many more SNP than used in the genomic prediction equation significantly increases the accuracy of genomic prediction from low-coverage GBS. Overall, QUILT [21] in combination with large imputation reference panels outperformed the alternative genotyping and imputation methods across all accuracy metrics: imputation accuracy, genotype concordance, GEBV correlations and GEBV regression coefficients.
While a number of genotype specific accuracy metrics are available to compare imputation strategies, this study focused primarily on the accuracy of genomic prediction from the imputed genotypes. This approach was chosen as the study primarily aimed to investigate the optimum imputation strategy for the purpose of in situ genomic prediction. Imputation specific metrics such as genotype accuracy and concordance were used as a secondary means to evaluate the differences between QUILT and GLIMPSE, as they performed similarly based on genomic prediction metrics alone. The relative performance of the different strategies with regard to GEBV correlation and bias, highlighted the robustness of genomic prediction using a GBLUP model to a modest degree of imputation errors. This is in line with other literature that has reported little to no significant decrease in GEBV accuracies when imputation errors are introduced by decreasing SNP panel density from 50 k to as low as 3 k [35]. Nonetheless, irrespective of the accuracy metric used these results demonstrate skim-GBS using ONT data can objectively outperform low-density SNP arrays on both metrics when an appropriate imputation strategy is employed.
Calculating the prediction bias in the ONT predictions revealed a general trend of underpredicting GEBVs for CL score, BCS and HH, while over-predicting GEBVs for BW. This pattern was seen across all of the methods; however, QUILT still had the least prediction bias of the four methods across all coverages. A possible explanation for the trends in prediction bias observed could be systematic genotyping errors in SNP loci with a large phenotypic effect. For example, a SNP in the genomic region encoding PLAG1 on chromosome 14 has previously been identified to have a significant effect on heifer fertility in Australian Brahman cattle [36], and if a systematic error were to occur for this region, predictions would likely be biased. Therefore, the presence of systematic errors in the QUILT imputed genotypes should be investigated and their subsequent proximity to SNP of large phenotypic effect determined.
Another possible explanation is that the bias is actually introduced by the process of imputing 35 k SNP array genotypes up to HD SNP array density. This implies that the prediction bias observed in body weight, for example, is actually caused by the low-density SNP array genotyping and not by the ONT GBS method. This is supported by the fact that when the ONT GEBVs are compared to the HD SNP array GEBVs, less prediction bias is observed with a more random spread, than when they are compared to the 35 k SNP array GEBVs. Furthermore, when comparing the 35 k SNP array GEBVs to the HD SNP array GEBVs a reasonable degree of bias was observed in body weight and hip height. This would suggest that some systematic genotype errors, which predominantly affect hip height and body weight GEBVs, are being introduced when imputing from the low-density 35 k array to the HD SNP array density. Further research should be carried out to verify if this is the case and investigate why the errors are being introduced.
QUILT’s ability to more accurately genotype and impute SNP from long-read sequence data is likely a result of the way in which it assigns haplotypes. Imputation packages that use VCF genotypes such as Beagle and GLIMPSE use SNP panels to construct known ancestral haplotypes in order to haplotype phase the genotypes. Missing loci are then imputed using the ancestral haplotypes. This approach is suitable for SNP array genotypes, where the genotypes themselves are independent observations. However, when using sequence data this approach ignores the fact that a single sequence read can span multiple SNP, meaning nearby SNP no longer need to be considered independent [21, 37]. QUILT uses the co-localization of nearby SNP within sequence reads to assign parental gametes and then imputes these two read groups (derived from each of the parental gametes) as haploids [21]. As SNP reference panel sizes increase this assumption becomes even more powerful because, on average, a single read spans more SNP. Given the average read length of 1792 bp for the ONT sequence data used for this study [31], reads on average spanned 30, 9, 6, 3 and 0.3 SNP for the five SNP reference panels, in order of largest to smallest, respectively. This likely meant that QUILT was able to more accurately assign SNP allele haplotypes, which ultimately increased the accuracy of genomic prediction. This implies that nearby SNP should not be treated as independent observations in future skim-GBS work. That is to say, when dealing with low-coverage sequence data (in particular long-read sequence data) genotyping and imputation approaches that use the co-localization of SNP on a sequence read to assign haplotypes should be used. We anticipate the popularity of such methods for genotyping and imputation of sequence data will increase as the method is further refined.
Although GLIMPSE was overall not as accurate as QUILT, it performed well even when used in combination with the smallest imputation reference panel (the HD SNP array) and was significantly more accurate than either Beagle method. This is likely a result of GLIMPSE using genotype likelihoods rather than hard-coded genotypes, such as the case with Beagle5.2. Genotype likelihoods allow for some of the individual read information to be maintained through to imputation when using GBS, however, are more computationally demanding to calculate and impute. Genotype likelihoods do not maintain haplotype of origin information which is available when a single read spans multiple SNP. This likely explains why GLIMPSE is relatively unaffected by decreasing imputation reference panel size. This means that in circumstances where very large imputation reference panels are available, QUILT is the better option as it can use the higher SNP density to accurately phase reads. When only small imputation reference panels are available, GLIMPSE may offer some increase in accuracy over QUILT. Overall, however, using QUILT with very large imputation reference panels was the most accurate for genomic prediction.
When comparing the accuracy of the QUILT and GLIMPSE imputed GEBVs to GEBVs derived from the bovine HD SNP array, there was no difference between 35 k SNP array GEBVs and ONT sequencing coverages as low as 0.1 × for QUILT and 0.5 × for GLIMPSE. This suggests for the purpose of genomic selection, 0.1 × sequencing coverage with ONT is as informative as low-medium density SNP arrays in beef cattle. Interestingly, no trend in prediction bias was observed when comparing the QUILT ONT GEBVs to the HD SNP array GEBVs. This suggests the underlying variation in GEBVs between the QUILT ONT and bovine HD SNP array predictions is random, unlike the variation in GEBVs between the ONT and 35 k array predictions. This means the long ONT sequence reads are potentially better able to account for decay in linkage disequilibrium (LD) than imputation from 35 k SNP array density up to HD array density. For example, at 0.1 × sequence coverage, theoretically, almost 5 million SNPs had sequence coverage from which to assign haplotypes using the whole genome SNP panel with no filter. Assuming random sampling of the haplotypes, we get 2.5 million SNPs for each parental haplotype, which is significantly more information than low-density SNP arrays for populations with greater LD decay. Examples of such populations include cattle breeds of B. indicus origin, such as Brahman and Droughtmaster, which were used in this study. This finding may suggest that low-coverage ONT genomic prediction may be particularly advantageous in mixed breed populations with a larger effective population size, such as Australia’s northern beef industry.
Previous studies have also found 0.5 × sequencing coverage suitable for skim-GBS for the purpose of genomic selection. Using Illumina sequencing data, Zhang et al. [38] reported high GEBV accuracies for coverages equal to or greater than 0.5 × in aquaculture species. While in the root crop cassava Long et al. [39] reported imputation accuracies from 0.5 × sequencing coverage were not different to accuracies from 5 × sequencing coverage using ONT data. By demonstrating 0.1 × sequencing coverage with ONT data is able to generate accurate GEBVs, we have further decreased the minimum sequencing coverage by five-fold, which has significant economic and practical implications for skim-GBS with ONT.
While the sequencing coverages for accurate imputation reported here are not as low as those reported for RR-GBS methods, the additional wet-lab procedures necessary for RR-GBS methods limit their portability. Additionally, given the low complexity of the genomes of most economically important livestock species, reduced-representation GBS methods are likely not necessary. For economically important species with highly complex genomes, such as wheat and sugar cane, reduced-representation GBS may still be the best option [10].
A potential solution to help handle complex genomes without the additional laboratory steps could be to use ONT’s adaptive sequencing technology [40] as a type of reduced-representation GBS method. Adaptive sequencing is an ONT-specific method to enrich or deplete desired genomic regions in real time. By enriching target SNP regions using adaptive sequencing, genome complexity could be decreased without the additional wet-lab steps of restriction enzyme digests. This method of ONT specific RR-GBS could decrease the required sequencing coverage for GBS on ONT platforms. This method could also be used to target QTL for monogenic traits alongside skim-GBS. Currently, limitations around computing power exist to make this method feasible outside of the lab. However, ONT’s partnership with NVIDIA could deliver the necessary improvements to make this method feasible at scale [41].
In this study, we compared ONT GEBVs to SNP array GEBVs using the same number of SNP markers for predictions. Future work should look to investigate the accuracy of ONT GEBVs against true breeding values and compare these results to SNP array GEBVs against the same true breeding values. Using this comparison, other methods to further increase the accuracy of ONT GEBVs could be better investigated. Such methods include using Bayesian-derived SNP effects to better estimate the distribution of effects, and incorporating more SNP into the prediction, rather than imputing to the whole genome sequence and then sub-sampling back to 641,000 SNPs. Both these methods have been shown to slightly increase prediction accuracies [42, 43] and would add little computational time once the SNP effects are calculated.
Reducing the size of the SNP reference panel had a significant effect on the imputation speed for all methods. Beagle5.2 was significantly faster than QUILT and GLIMPSE, which is a result of the increased computational complexity of using genotype likelihoods. Browning et al. [18] reported similar advantages in speed using Beagle5.0 over other imputation methods such as Minimac3, Minimac4 and Beagle4.1, particularly for large SNP panels. Although not as fast as Beagle5.2, QUILT offers a significant advantage in removing the need for prior genotyping. This decreases the complexity of file handling and depending on the genotyping approach used, decreases the overall turnaround time. Furthermore, QUILT was able to accurately impute from lower sequencing coverages than Beagle5.2 and GLIMPSE, meaning QUILT would further save time during data acquisition. Additionally, QUILT also allows pre-processed haplotype reference files to be loaded. This can be used to further decrease computational time where the same SNP reference panel is being used repeatedly. However, based on the imputation times reported here, data acquisition (i.e., sequencing time) will remain the rate-limiting step for real-time genomic prediction. Therefore, efforts to decrease the turnaround time should focus on decreasing the required sequencing coverage, not the speed of the bioinformatics pipeline.
In general, increasing the size of the reference panel increased imputation accuracy when using QUILT. However, this was not the case when increasing the panel size from the second largest panel (MAF > 0.1) to the largest reference panel (no MAF filter). Instead, a small decrease in imputation accuracy and GEBV correlations was observed across all coverages. It is believed this is related to the increased likelihood of sequencing errors appearing at rare variant loci, possibly causing non-reference SNP to be called at these loci and consequently introducing errors in the haplotype phasing and imputation. Given 33,202,930 rare SNP (MAF < 0.1) were removed from the unfiltered reference panel to create the second largest reference panel, it could be reasoned that up to 0.1% or 33,202 of those variants may overlap a sequencing error at 0.05 × , using a conservative ONT error rate of 2%. Therefore, it seems reasonable to assume sequencing errors may be the cause of the decrease in accuracy.
Despite, increasing the accuracy of genomic prediction and imputation accuracy, the removal of rare variants from the reference panels represents a limitation to the wider application of the imputation strategies evaluated here. For example, rare variants are useful for GWAS and genetic disease screening of some monogenic traits. Therefore, the filtered SNP panels in this study are only suitable for genomic prediction of polygenic traits. Incorporating rare disease loci, such as Pompes disease [44, 45] or horn/poll markers [45, 46] for cattle into the SNP panel would allow for screening of these dominant traits as well. However, it is unlikely accurate imputation of these genotypes will be possible at sequencing coverages below 1 × , due to the inverse correlation between imputation accuracy and MAF [39]. Additionally, the economic significance of traits such as Pompes disease and horn/poll would mean less error would be tolerated by producers. For instance, a bull incorrectly genotyped as homozygous poll (the desirable genotype) rather than heterozygous would mistakenly fetch a premium at a sale. Therefore, the ability for low coverage ONT data to accurately genotype rare causative variants must be validated. This is a potential area where targeted enrichment through adaptive sequencing or other methods could prove useful.
The accuracies achievable with QUILT at 0.1 × sequencing coverage suggest multiplexing of up to 40 human genome size samples on a single MinION (or up to 200 on a PromethION) flow cell could be possible. This would bring the reagent cost of this method down to below USD $40 per sample, while providing results in under 24 h [47]. With further optimisation of SNP panel design and the incorporation of new ONT technologies, the level of multiplexing possible will increase further still, making near real-time genomic prediction practically and economically feasible.
Comments (0)