Comprehensive evaluation of the implementation of episignatures for diagnosis of neurodevelopmental disorders (NDDs)

Subjects and cohort

The study includes peripheral blood from 26 samples collected at the ULB Erasme Hospital Center for Human Genetics (Brussels, Belgium). Informed consent was obtained from all research participants according to the protocol approved by the Ethics committee of the Erasme Hospital. These are composed of 22 controls, 1 patient with a confirmed diagnosis of Kabuki syndrome (OMIM: #147920) and 3 patients with a confirmed diagnosis of Sotos syndrome (OMIM: #117550) through the identification of class IV or V variants by genetic testing. Control specimens were healthy individuals without any developmental delay, intellectual disability, or congenital anomalies. Additional peripheral blood datasets were retrieved from the gene expression omnibus. They include data used to generate the Kabuki (IM450K, GSE97362) and Sotos (IM450K, GSE74432) episignatures (Choufani et al. 2015; Butcher et al. 2017), a cohort of 489 unresolved cases (IM450K, GSE89353) (Barbosa et al. 2018), 521 control samples (IMEPIC, GSE152026) (Noguera-Castells et al. 2023), 656 control samples (IM450K, GSE40279) and 4 additional control samples (IMEPICv2, GSE222919) (Teschendorff et al. 2013).

DNA methylation experiment

Whole-blood DNA samples were bisulfite converted using the EZ DNA Methylation™ Kit from Zymo Research according to manufacturer standard procedure. The sodium bisulfite converted DNA was then hybridized to the Illumina Infinium MethylationEPIC v1.0 BeadChip to interrogate over 850,000 CpG sites in the human genome. The scan of the array has been done by the BRIGHTcore facility of UZ Brussels.

Training and testing sets

Training and testing sets for the Kabuki and Sotos episignatures were derived from datasets GSE97362 and GSE74432, respectively. Additional Files 7 and 8 contain the GEO ID of the training sets for the Kabuki and Sotos cohort, respectively. The training sets were balanced with the same number of control and case samples (Table 2). Two additional testing sets composed of IMEPIC samples generated at the ULB Erasme Hospital Center for Human Genetics (Brussels, Belgium) were used to assess the model performances on external IMEPIC data (Table 2).

Table 2 Description of the three datasets used in this work. Two testing sets based on 450 K and EPIC samples have been used to test the performances of the classifiers. One training set based on 450 K samples has been used to train the models

The Sotos case training set is composed of 8 females and 11 males, with an average age of 11.67 ± 10.97 years (range 0–41 years), while the control group is composed of 12 females and 7 males with an average age at sample collection of 10.94 ± 5.45 years (range 3–18 years) (Additional File 8). The Kabuki case training group is composed of two females and nine males with an average age at sample collection of 13.64 ± 5.64 years (range 1–20 years), while the control group is composed of two females and nine males with an average age at sample collection of 10 ± 3.43 years (range 5.5–13 years) (Additional File 7). The testing cohort includes the rest of the respective datasets without the samples with variants of unknown significance (VUS), as well as the data from Erasme Hospital.

Data preprocessing

Data from the original articles and in-house data were processed using the minfi library in R (Aryee et al. 2014). Probes with low-quality signals (detection p value > 0.01), cross-reactive probes, probes harboring single nucleotide polymorphisms (SNPs), and probes on sex chromosomes were removed. The cross-reactive probes were removed using the R library maxprobes. The total number of probes retained for each dataset is shown in Table 3. Normalization of the data was performed using the 6 different methods available in minfi using default parameters: Noob (Triche et al. 2013), Illumina, Funnorm (Fortin et al. 2014), Raw (no normalization), Quantile (Touleimat and Tost 2012) and Swan (Maksimovic et al. 2012). The Illumina normalization method implemented in minfi is the same method used in Choufani et al. (2015) and Butcher et al. (2017) to normalize the beta-values. The normalization, as for the preprocessing, was performed separately on training and testing sets in order to reduce the bias and avoid overestimating the accuracy of the classifiers in the testing sets. Quality Control (QC) of the samples was assessed using minfi package (Aryee et al. 2014): no bad quality samples were found, thus, all samples were kept for the analysis. We used the Illumina annotations based on the hg19 genome version in order to have the same, comparable positions and annotations for all array versions.

Table 3 Nnumber of probes remaining after preprocessing and filtering together with their percentagesDifferentially methylated regions (DMRs) episignatures

To identify DMRs in the Kabuki and Sotos training sets, we used the ChAMP library available in R on the training set (Tian et al. 2017). Particularly, the DMR function has been adopted specifying bumphunter as DMRs detection function (Jaffe et al. 2012). The following non-default parameters were used: number of bootstraps = 1000; minimum number of probes per DMR: 10. For both syndromes, this function was run on the training set 6 times: once per normalization method. The cutoff parameter was decided by the software, except for the Sotos dataset, where we applied a more stringent filter in order to reduce the number of DMRs found. Thus, the cutoff parameter was set as 0.7 for Sotos syndrome and kept as 0.292 for Kabuki syndrome. Only DMRs with an adjusted p value < 0.05 were retained. Once the DMRs were computed, the median methylation value of the probes included in each DMR was extracted and considered as the methylation value for that DMR. The methylation values of the DMRs were computed separately for the training and the testing sets.

Hierarchical clustering and UMAP

Hierarchical clustering was performed using the seaborn clustermap function, with average as linkage method and cityblock (manhattan) as the distance metric to cluster samples, which is preferred for Euclidean distance for high-dimensional application (Aggarwal et al. 2001). Python umap package was used to visualize the samples in a two-dimensional space, using beta-values from the DMC-esig or median beta-values from the DMRs in the DMR-esig, respectively. Data were scaled with Python scikit-learn StandardScaler function prior to using UMAP.

Machine learning classifiers for discriminating between syndrome and non-syndrome samples

Support Vector Machine (SVM), Random Forest (RF), and Logistic Regression regularized with penalization term (PLR) were trained either on the beta-values of the DMC-esig, or on the median beta-values of the DMRs constituting the DMR-esig. Moreover, for each syndrome, the models were trained on each of the six normalized training sets as described above. The optimization algorithm of PLR was set as liblinear within the solver parameter, while we kept the default L2 penalty with a dual formulation. The hyperparameters were specified within a GridSearch Analysis and the optimal ones were chosen through a leave-one-out cross-validation (LOOCV) (Additional File 9). The classifiers were then tested on the test sets to assess their performances on new, unknown samples. The prediction probabilities were also extracted from the results. Matthew’s correlation coefficient (MCC) was used to evaluate the models’ performances. Indeed, MCC is preferred to the F1 score in binary classification tasks due to its higher robustness to imbalanced datasets (Chicco and Jurman 2020). Model training, testing, and performance assessment were done using scikit-learn. The procedure adopted to build classification models for both Sotos and Kabuki syndrome is summarized in Fig. 6.

Fig. 6figure 6

Procedure adopted to build classification models based on DMR-esigs for both Kabuki and Sotos syndromes

Testing robustness for missing probes

To avoid missing values in the DMC-esig, nan resulting from CpGs removal (due to new array version and/or low-quality signal probes) were imputed using pandas.fillna(method = ‘ffill’). In order to test the robustness of the different classifiers, we also removed from all the testing datasets 5, 10, 20 or 30% of CpGs from the whole array after normalization. We then imputed the missing probes with the same procedure adopted above. This process simulates a scenario in which the release of new technology and the removal of low-quality signal probes introduce missing values in the samples to be classified with existing models. In the DMR-esig, no imputation was performed and for each DMR the median methylation value has been computed on the remaining probes within that DMR. Then, prediction performances and probabilities for each model have been extracted in order to evaluate the robustness of both approaches. We repeated this process ten times, using different random seeds to select the CpGs to remove and computed the mean and the standard deviation of the MCC from all individual performance evaluations.

Validation of the DMR-esig models on external datasets

To assess the specificity of the models trained on DMR-esig across all array versions, a cohort of 521 IMEPIC control samples (GSE152026), 656 IM450K control samples (GSE40279) and 4 additional IMEPICv2 control samples (GSE222919) were used. Furthermore, the models were applied to a cohort of 489 IM450K unresolved cases (GSE89353). Data from GSE89353, GSE222919, GSE152026, and GSE40279 were retrieved already preprocessed and normalized using BMIQ (Quantile), Illumina GenomeStudio, the wateRmelon dasen function and Illumina GenomeStudio, respectively, as described in the original studies (Barbosa et al. 2018; Noguera-Castells et al. 2023; Teschendorff et al. 2013; Hannum et al. 2013). Beta-values were used as such without any further processing.

Functional analysis

All analyses were performed using R (version 4.1.2). For both Kabuki and Sotos DMR-esigs using Illumina, Noob and Funnorm normalization methods, we extracted the CpGs belonging to the DMRs of each episignature. Then, the CpGs were annotated to the genome assembly GRCh37 (hg19) using annotatr (v. 1.24.0). Genic annotations included regions 1-5 Kb upstream of the transcription start site (TSS), gene promoters (< 1 Kb upstream of the TSS), intergenic regions, 5’UTRs and 3’UTRs, exons, introns and intron–exon boundaries, using the following built-in annotations: ‘hg19_basicgenes’, ‘hg19_genes_intergenic’ and ‘hg19_genes_intronexonboundaries’. The UTRs, exons, introns and intron–exon boundaries were considered part of the “gene body” category. The CpGs genomic annotations of the episignatures were then compared to the CpGs genomic annotations of the entire corresponding training set in order to understand which regions are most enriched in the DMR/esigs.

Gene ontology enrichment analysis was performed using DAVID (Huang et al. 2009) to identify the biological processes most enriched. The genes used for the enrichment analysis were only the in-common genes among the DMRIllumina, DMRNoob and DMRFunnorm episignatures. Enriched terms were filtered at adjusted P value < 0.05 (Benjamini–Hochberg procedure).

Software versions

Analyses were performed either in Python (version: 3.8.10) or R (version 4.1.2).

The following version of the packages were used: minfi (version 1.40.0); maxprobes (version 0.0.2); ChAMP (version 2.24.0); seaborn (version 0.12.2); umap (version 0.5.3); scikit-learn (version: 1.2.2); pandas (version: 1.5.3).

留言 (0)

沒有登入
gif