GWAS aims to identify links between genetic variation and an outcome of interest. To achieve this, genetic variants are statistically tested for association with the outcome, in an unbiased genome-wide manner. While GWAS can be used to study all types of genetic variants including copy-number variants and rarer sequence variations, it typically refers to the assessment of common single nucleotide polymorphisms (SNPs) and small insertion-deletions which are highly prevalent across the human genome (exceeding ten million instances) and which are relatively easy to genotype at scale on genotyping arrays. In line with their relatively high population frequency, common genetic variants generally contribute to phenotypic variability with subtle effects. As such, large cohorts of patients/participants are usually required for a GWAS to detect statistically significant associations between a genetic variant and the phenotype or outcome of interest.
The fundamental idea behind a GWAS is to systematically test millions of variant sites across the genome and assess whether different alleles at each given variant site significantly co-vary with the phenotype of interest. For instance, in the case of a binary disease outcome, the frequency of an allele at a given variant site is compared between individuals with the condition (cases) and those without (controls). Associations of variants with an outcome of interest can be assessed using various statistical tests (e.g., logistic or linear regressions), depending largely on whether the trait of interest is binary (e.g., presence or absence of a disease) or a continuous variable (e.g., diameter ejection fraction of the left ventricle). Additionally, covariates such as sex, genetic ancestry, and year of birth should be included in such statistical models to account for stratification and to prevent confounding [10].
Overview of the StepsThe first step for any GWAS involves selecting a study population. The study population could be derived from large biorepository datasets (such as UK Biobank [11] or All of Us [12]), or from disease-focused patient cohorts that focus on recruiting disease cases specifically for the study [10].
After the study population is selected, the genetic data needs to be acquired. Most GWASs to date have utilized SNP microarrays, which can typically genotype ~ 200-700k common variations (SNPs) across the genome. Given that microarrays assess only a small fraction of genomic variation, imputation is subsequently used to expand the analysis to non-genotyped SNPs that are in linkage disequilibrium with the genotyped SNPs, thus increasing the number of variant sites studied. This process utilizes ancestry-matched haplotype reference panels and has been highly standardized by means of imputation servers such as the TopMed Imputation Server and the Michigan Imputation Server [13]. More recently, studies have started to utilize whole-genome sequencing (WGS) as a means of genotyping for GWAS approaches. Besides allowing genotyping of both common and rare genetic variants, this approach circumvents the necessity of imputation as it allows for genotyping of variants genome wide.
Before imputation or association analyses are performed, stringent quality control of the genetic data needs to be performed to exclude variant sites and individuals that do not meet predefined criteria [14]. Since GWASs are susceptible to bias from population stratification, quality control also involves crucial steps to infer the genetic ancestry of all samples; subsequent steps may be performed to exclude ancestral outliers. Once the final dataset is of high quality on both phenotypic and genetic levels, the data undergo a genome-wide scan using systematic association analyses, as described above.
OutputThe main results of a GWAS are called summary statistics, which are essentially an overview of all tested variants and their respective effect size, p-value, and other relevant association meta-data (Fig. 1A). Summary statistics are often visualized using a Manhattan plot that displays the p-value for each tested SNP and its association with the phenotype/outcome of interest. In such a plot, the highest points display variants with the most statistically significant association. Summary statistics could subsequently also be processed in downstream analyses such as meta-analysis and multi-trait analysis (MTAG is a method where GWAS from genetically correlated traits are combined to increase statistical power) [15] and used for polygenic scores [10].
Fig. 1Polygenic scores for dilated cardiomyopathy: development, validation, clinical perspectives, and limitations. (A) General overview for development of GWAS and PRS with an example of summary statistics, that is an output of the GWAS and formula used for PRS development. (B) PRSs analyses performed by Zheng et al., 2024 and Jurgens et al., 2024. Divided into 3 panels: (1) DCM risk prediction across ancestries and genetic sex; (2) PGS distribution in cases compared to controls; (3) greater polygenic burden in genotype-negative (gen–) versus genotype-positive (gen +) DCM cases. C. Left panel: PRS applications in DCM risk prediction, severity assessment, personalized treatment, and integration with clinical scores. Right panel: Current limitations include ancestry bias, lack of clinical validation, no standardized risk thresholds, and exclusion of rare pathogenic variants. DCM: dilated cardiomyopathy, PGS: polygenic score, GWAS: genome-wide association study, MTAG: multi-trait analysis of GWAS, LVEF: left ventricular ejection fraction, LVEDV: left ventricular end-diastolic volume, MRI: magnetic resonance imaging, AoU: All of Us (cohort), AUMC: Amsterdam UMC (cohort), UKB: UK biobank (cohort), OR: odds ratio, SD: standard deviation, gen +: genotype-positive DCM cases (carrying a rare pathogenic variant in DCM causing genes), gen–: genotype-negative DCM cases (rare pathogenic variant was not identified), rsID: reference SNP identifier, A1: effect allele, A2: non-effect allele, Freq: allele frequency, Beta: effect size estimate, SE: standard error, P-value: statistical significance of association
Fig. 2GWASs and PGSs by Zheng et al. and Jurgens et al. (A) Design of GWAS and PGS development. In each study, DCM GWAS was derived through meta-analysis of smaller GWASs conducted in biobank and clinical cohorts for NICM, NI-DCM, and DCM. It was subsequently meta-analyzed with cardiac MRI GWAS traits using MTAG, incorporating key left ventricular phenotypes such as LVEF, LVESV, and myocardial strain. (B) DCM PGS associations. The PGSs tested in both studies were constructed out-of-sample, excluding the cohort in which they were evaluated. P-values are two-sided and were calculated from a logistic regression model and not adjusted for multiple testing. All P-values were less than 0.01. Zheng et al.: Study cohort: 347,585 unrelated UKB participants with and without DCM. All models included age, age2, sex, and first ten genetic PCs as covariates. Jurgens et al.: AUMC cohort: Amsterdam UMC cohort with 8,185 participants, of which 978 DCM cases (European ancestry: 7,761 individuals, of which 783 cases; Females 4,453 individuals, of which 418 cases; Males: 3,732 individuals, of which 560 cases; Genotype positive: 193 individuals; Genotype negative: 294 individuals. Model: logistic regression analyses were adjusted sex, and ancestral principal components 1–10. AoU cohort: All of Us cohort with 182,701 of which 928 DCM cases (European ancestry only (N = 506 cases and 95,510 controls), African ancestry only (N = 246 cases and 36,864 controls), and Admixed-American ancestry only (N = 107 cases and 28,784 controls)). Model: logistic regression analyses were adjusted for age, age2, sex, and ancestral principal components 1–10. (C). PGS prediction accuracy comparison within three European ancestry datasets between Jurgens et al. and Zheng et al. DCM PGS. Association results for the PGS constructed from MTAG-DCM with DCM status across three different datasets. Study cohorts: AUMC (DCM cases (N = 783), controls: (N = 6,978). AoU dataset: samples from Massachusetts General Hospital (MGB) (NI-DCM cases (N = 506), controls (N = 95,510)). UKB: UK Biobank dataset (NI-DCM cases (N = 793), controls (N = 325,313)). Model: logistic regression, adjusted for sex, ancestral principal components 1–12, age, age2 in UKB and AoU, and only for sex, ancestral principal components 1–12 in AUMC cohort. Data are presented as estimated odds ratios with 95% confidence intervals. R2 for each PGS in the respective dataset, where R2 represents the residual variance explained by the PGS (computed as the improvement of model R2 inclusive of PGS as compared to the model without PGS, divided by the proportion of residual variance); all R2 values were computed on the liability-scale to allow better comparisons across datasets. GWAS: genome-wide association study, NI-DCM: nonischemic dilated cardiomyopathy, MTAG: multi-trait analysis of GWAS, OR: odds ratio, 95%CI: 95% confidence interval, SD: standard error, HR: hazard ratio, AUC: area under the receiver operating characteristic curve (AUC provided in part B for models with covariates, AUC in part C, without covariates), R2: variance explained, gen +: DCM cases with a rare pathogenic or likely pathogenic rare variant in DCM-causing genes with strong or definitive evidence based on ClinGen curation, gen–: DCM cases without identified rare pathogenic or likely pathogenic rare variant
Given the high number of statistical tests performed in a GWAS, an appropriate statistical significance level is required, with the Bonferroni-corrected p-value threshold typically applied, which equates to P < 5 × 10–8 (referred to as the genome-wide statistical significance threshold) [16]. The ability to identify SNP associations with a p-value below this threshold rests on the statistical power which in turn depends in part on the size of the (case-control) sample used, the variant effect size, and the frequency of the variants being studied.
Polygenic Scores MethodologyPGSs are ‘personalized’ scores that are calculated per individual and are based on GWAS summary statistics. Such scores usually represent a weighted sum of the risk alleles that are carried by an individual, with weights derived from the respective effect sizes obtained in GWAS (usually the logarithm of the odds ratio for binary traits, and the beta coefficient for continuous traits) (Fig. 1A). Exactly how variants are included and weighted in PGSs differs widely between different approaches and methods: the selection of risk alleles for the score can be restricted by a certain significance threshold, or the score can be genome-wide, including thousands or even millions of variants [17].
Discussing all available methods is out of scope for the current review, although it must be noted that newer methods tend to only improve prediction marginally compared to the previous gold standard. A crucial component in the evaluation of any PGS, however, pertains to out-of-sample prediction: it is important that the performance of PGS is evaluated in a dataset that is independent of the original GWAS on which it is based (and independent of other data used for the training or tuning of the PGS).
Early GWASs and PGS Studies in DCMOne of the first common variant association studies for DCM was published by Stark et al. in 2010, comparing 664 DCM cases to 1,874 controls [18]. This study analyzed only 30,920 SNPs and revealed four SNPs displaying association with idiopathic DCM with a p-value below the genome-wide significance threshold (5 × 10–8). In this study, only one of these SNPs, namely near the HSPB7 gene (rs1739843), could be replicated in independent cohorts.
Subsequent GWASs were conducted using genome-wide genotyping arrays followed by imputation, testing a considerably larger number of SNPs spread throughout the genome- [19,20,21,22,23,24,25,26,27,28] or exome-based genotyping arrays [29, 30]. The first smaller studies were performed on clinical cohorts, where participants were recruited from (specialized) hospital clinics [19,20,21,22,23,24,25,26,27,28,29]. In 2011, Villard et al., utilizing a discovery cohort comprising 1,179 DCM cases and 1,108 controls, discovered three DCM-associated loci, two of which were replicated in independent samples (rs2234962 located in BAG3; rs10927875 located in an intron of ZBTB17 nearby HSPB7) [19]. In 2014, a GWAS by Meder et al. [21] identified a locus near HCG22 (rs9262636), which was replicated in an independent cohort.
Esslinger et al. were the first to assemble a genotyping dataset of several thousand DCM cases, in 2017. The authors used exome chips to genotype the protein-coding regions in 2796 DCM patients and 6877 controls. They identified previously reported associations near BAG3 (rs2234962) and ZBTB17 (rs10927875), as well as novel loci near TTN (rs3829746), SLC39A8 (rs13107325), MLIP (rs4712056), FLNC (rs2291569), ALPK3 (rs3803403) and FHOD3 (rs2303510) [29].
While the above-mentioned GWASs were conducted on participants predominantly of European ancestry, in 2018 Xu et al. published the first DCM GWAS on individuals of African ancestry. They revealed a novel locus in an intron of the CACNB4 gene (rs150793926). However, they were not able to perform a replication analysis for this SNP [22].
Tadros et al. (2021) performed a meta-analysis of previously published DCM GWAS, and MTAG including LV traits, resulting in 17 DCM-associated loci, 7 of which were novel. While in this study, the concept of PGS was explored in the context of HCM, no DCM-specific PRS was published in this study [26].
Beyond the GWASs that studied risk variants through clinical DCM case recruitment (Table 1A), broader genetic studies leveraging biobank data have since been conducted (Table 1B). These studies leveraged resources such as Biobank Japan [23] and the UK Biobank [23, 25, 30], which comprise atlases of genetic associations, linking phenotypes— such as diseases, biomarkers, and medication usage—to genomic variations through large-scale genome-wide association analyses. While these biobank-based studies included DCM as a phenotype in their analyses, it was not their primary focus. Even though neither of the papers reporting on these large-scale analyses mentioned SNPs associated with DCM, summary statistics for DCM are available in the GWAS Catalog and can be used in meta-analyses with other case-control sets.
The first PGS for DCM was published in 2021 by Garnier et al., based on a GWAS comprising 2651 DCM cases and 4329 controls. The investigators identified and replicated two additional DCM-associated loci in SLC6A6 (rs62232870) and SMARCB1 (rs7284877) and confirmed two previously identified DCM loci near BAG3 and HSPB7. Based upon these four SNPs, they constructed several weighted and unweighted PGSs, incorporating between one and eight risk alleles. The weighted PGSs assigned weights to each SNP based on the beta value derived from a sub-meta-analysis of two replication cohorts. They compared the performance of weighted and unweighted PGSs against each other. Each score was composed of a different number of risk alleles (ranging from one to eight), with the five-allele score serving as the reference. The score that included eight risk alleles compared to the reference, demonstrated a three-fold increased risk of DCM, indicating that a higher number of risk alleles correlates with greater disease susceptibility. However, the discriminatory capacity of this score to distinguish between cases and controls remains unclear. Additionally, a limitation of the study was that the PGSs were tested within the discovery cohort, raising concerns about overfitting and generalizability.
Studies identifying genetic loci associated with MRI traits, such as left ventricular volumes and ejection fraction, have also significantly advanced our understanding of DCM genetics. An important study in this regard is the one conducted by Pirruccello et al. [31] Using summary statistics from GWAS on different MRI parameters, the authors constructed several PGSs and assessed their association with the incidence of DCM. Consequently, they demonstrated a strong relationship between a 28-SNP PGS based on SNPs associated with LVESVi, and the occurrence of DCM after adjusting for age, sex, genotyping batch, and the first five principal components of ancestry. [31] Summary statistics of GWAS on MRI traits from this and similar work [25, 32] have since also been used to boost GWAS discovery for DCM through multi-trait analysis (MTAG) [15] frameworks which leverage genetic correlation across different phenotypes (see below) [26,27,28].
Latest GWASs and PGS Studies in DCMIn 2024, the two largest GWAS meta-analyses studies for DCM to date - Jurgens et al. [28] and Zheng et al. [27] - were published. These studies yielded a substantial boost in genetic discovery for DCM, as they identified several dozen (novel) loci and genes associated with DCM or non-ischemic cardiomyopathy (NI-CM). Both studies conducted meta-analyses of clinical and biobank cohorts and subsequently performed multi-trait analyses, incorporating GWAS for MRI-derived left ventricular (LV) traits (MTAG framework) (Table 1C).
The discovery cohort of Jurgens et al. consisted of 9,365 strict DCM cases and 946,368 controls. Through the GWAS and MTAG, they identified 70 genomic risk loci (38 loci from DCM GWAS and 65 loci from MTAG) at genome-wide statistical significance. The discovered loci showed broad replication in independent samples and were mapped to 63 prioritized genes. The discovery cohort of Zheng et al. consisted of 14,256 DCM/NI-CM cases and 1,199,156 controls [27]. They identified 80 loci (62 loci from DCM GWAS using an FDR threshold of 1% and 54 from DCM MTAG at genome-wide significance) and prioritized 62 putative effector genes. These two studies included approximately 3–4 thousand overlapping cases from 3 cohorts. Additionally, a subset of samples from Zheng et al. were used by Jurgens et al. for replication analyses. A detailed side-by-side comparison of both studies is provided in Fig. 2.
Based on these analyses, several genome-wide PGSs were constructed in both studies, which included hundreds of thousands of genetic variants. In the findings discussed below, we will be referring to the results pertaining to the PGSs that showed the best out-of-sample performance in either study.
Interestingly, Jurgens et al. systematically compared the PGSs from both papers (Fig. 2C) and compared their performance in three datasets of European ancestry (details in Jurgens et al., Supplementary materials). In their analysis, the PGSs based on MTAG summary statistics from both studies showed the best predictive performance. Overall, the authors found that the score by Jurgens et al. performed slightly better, although the confidence intervals still overlapped. Nonetheless, both studies produced PGSs that were able to significantly (but rather weakly) discriminate against healthy controls from DCM patients.
Perspectives on the Potential Clinical Utility of PGSsAn important aspect of PGSs pertains to their potential clinical utility and the possible clinical scenarios in which these scores could be deployed to improve the current clinical practice with regard to diagnosis, risk stratification, and/or (prophylactic) treatment. Below we describe how PGSs might potentially improve the management of patients with DCM or genetic risk for DCM in different clinical settings (Fig. 1C).
Individuals at Risk of DCMOne of the most important questions in the evaluation of an asymptomatic patient with an increased risk for DCM is whether the individual is likely to develop disease, and if so, how severe it is expected to become and when it might occur. These questions are particularly relevant for asymptomatic individuals who incidentally have been found to have an increased risk of DCM, for example, those with abnormal ECG findings associated with cardiomyopathy but normal echocardiogram or MRI, those with a family history of DCM where the proband is no longer available for genetic or clinical evaluation, and/or family members of a DCM patient who carry the familial pathogenic variant in a DCM-associated gene.
Patients with Possible Risk of DCM as an Incidental FindingRecent PGSs developed by Jurgens et al. and Zheng et al. have demonstrated the ability to differentiate DCM cases from controls, with DCM odds ratios ranging from 1.5 to 1.9 per standard deviation increase in PGSs and an area under the receiver operating characteristic curve (AUROC) of approximately 0.7 (Fig. 1B) [27, 28]. Individuals with very high PGSs had a markedly elevated risk of DCM, e.g., individuals with a PGSs in the highest 1% had over four- and six-fold higher odds of DCM compared to those with median and low PGSs, respectively. With further development, these scores could serve as a stratification tool to identify individuals at higher risk for DCM, warranting closer clinical evaluation and potential early interventions.
Unaffected Family Members with a Family History of DCMSince recent case-control GWASs include predominantly probands in their study cohorts, they did not evaluate PRS DCM performance for risk stratification in family members. However, previously published 28-SNP PRS for iLVESV [31] demonstrated a significant difference in DCM odds ratios (OR) between clinically affected family members of DCM probands, unaffected relatives, and healthy controls [33]. These data suggest that PRS could be useful to identify family members who may be at increased risk for DCM in situations where the proband is no longer available for genetic evaluation in gene-elusive families where no causative variant has been found.
Asymptomatic Carriers of (Familial) Pathogenic VariantsMost carriers of a known pathogenic DCM-causing variant do not develop the disease. For example, in a UK biobank cohort, that includes predominantly White British individuals older than 40 years old, more than 90% of the carriers show no signs or history of DCM [34]. This pattern has also been observed for TTN truncating variants (TTNtv), located in cardiac exons. While these variants are the most commonly identified genetic cause of DCM, occurring in 15–30% of genotype-positive cases, they are also present in 0.5% of the general population without DCM [35]. This lack of penetrance of such disease-causing variants suggests that other factors, like common variants, may influence the chance of actually developing DCM. Zheng et al. reported that DCM risk in carriers of DCM-causing variants (predominantly TTNtv) was higher compared with gene-negative individuals at the highest PGS centile [27], which is similar to the results of a similar analysis for HCM [36]. Pirruccello et al. addressed this question by testing the 28-SNP LVESVi polygenic score in a cohort of TTNtv carriers and showed a positive correlation with LVEDV and LVESV, and a negative correlation with LVEF in individuals without clinical DCM [31]. While follow-up data to assess whether these individuals develop DCM was not available, the observed association between the polygenic score and LV function suggest its potential utility in this setting.
PRS Application in DCM PatientsIn patients with clinical manifestation and an established diagnosis of DCM, PGSs could be of potential value for predicting disease severity and risk for the occurrence of adverse events, such as arrhythmias or heart failure-related outcomes.
SCD RiskTraditional risk prediction models for malignant ventricular arrhythmias, sudden cardiac death, or heart failure in DCM include only clinical factors [37]. In addition, more recently, ESC Guidelines for the management of cardiomyopathies highlighted that patients carrying pathogenic variants in high-risk genes (i.e., PLN, DSP, LMNA, FLNC, TMEM43, and RBM20) have significantly increased risk of major arrhythmic events. In line with this, gene-based risk models to predict arrhythmia risk have been developed and incorporated into clinical decision-making algorithms with regard to prophylactic ICD implantation [38]. Nevertheless, it is still unclear, whether common genetic variation could improve the current risk stratification algorithms for primary prevention of SCD. To date, PGSs for DCM were tested only in the context of association with risk of disease development. Recent data suggesting effects of a PRS based on HCM susceptibility variants in modulation of disease severity in HCM underscores the need to explore similar approaches in DCM [36].
Personalized TherapyRare genetic variants have already been linked to left ventricular reverse remodeling (LVRR), which can help to identify DCM patients more likely to experience myocardial recovery and improved cardiac function following pharmacological therapy. For example, TTN-related cardiomyopathies have been demonstrated to have a higher likelihood of LVRR compared to other DCM subtypes [39]. In the same manner, common variants may play a role in the response to heart failure therapy and/or antiarrhythmic agents. This concept has been explored in studies of statins, where individuals with the highest polygenic scores received the greatest benefit from treatment [40]. Another potential application for PRS lies in identifying genetic variants associated with adverse drug reactions, as successfully demonstrated in diabetes [41]. These findings highlight the potential of polygenic scores as an enrichment strategy for clinical trials [42] and for guiding personalized treatment, enabling the selection of the most effective therapy for each patient.
Combined ScoresHowever, using PGSs as a stand-alone predictive tool might not be the most effective way to predict risk, considering that recent PGSs explain 7–11% of the variance in DCM susceptibility in the various cohorts (Fig. 2C) [28]. More promising results could be provided using PGSs alongside age, sex, and other traditional clinical risk factors. Similar work has already been done for other diseases, such as atrial fibrillation, coronary artery disease, and diabetes [43]. Notably, additional covariates such as sex, ancestry, and age, already improve score performance. For instance, in such analyses, the area under the curve (AUC) was shown to improve by 0.1–0.2, reaching 0.7–0.75 [28,
Comments (0)